arxiv: 2605.10407 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Identified-Set Geometry of Distributional Model Extraction under Top-K Censored API Access

Wenhua Nie , Zicheng Zhu , Jianan Wu , Binhan Luo , Haoran Zheng , Jyh-Shing Roger Jang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords top-K censoringidentified setdistribution recoverylogit API accessmodel extractionKL divergence boundscapability transferLLM distillation

0 comments

The pith

Top-K censored logit APIs leave an identified set of teacher distributions whose total-variation diameter equals exactly U_K=(V-K)exp(τ)/(Z_A+(V-K)exp(τ)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern LLM APIs return only the top-K logit scores and censor everything else. The paper determines the exact recovery limits under this access model by characterizing all distributions consistent with the censored outputs as an identified set. It proves that the set's diameter in total variation is given by the closed-form U_K, which depends on the censoring threshold τ, the number of censored tokens, and the observed partition function Z_A. The work also supplies computable bounds on the KL divergence to the true distribution and shows that per-position recovery remains limited while capability transfer through distillation or generation can still be substantial.

Core claim

For censoring threshold τ, the compatible teacher distributions form an identified set whose total-variation diameter is exactly U_K=(V-K)exp(τ)/(Z_A+(V-K)exp(τ)), where Z_A is the observed partition function. For KL recovery, a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound are given, with an extension to reference-aware attackers. Experiments reveal that top-K censoring limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer.

What carries the argument

The identified set of teacher distributions compatible with observed top-K censored logits, whose spread is measured by the exact total-variation diameter U_K.

If this is right

The diameter U_K is strictly positive whenever τ is finite and V exceeds K, so perfect per-position recovery is impossible.
KL divergence to the true distribution can be bounded from below by a binary-endpoint calculation on the observed outputs.
An asymptotically tight upper bound on KL recovery holds when the ambiguity U_K is small.
Reference-aware attackers obtain improved bounds by using additional known information about the teacher.
Generation-based extraction can recover nearly all capability even when the recovered distribution matches the teacher only partially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

API providers could use the U_K formula to set τ so that leakage stays below a chosen tolerance while preserving utility.
The separation of distribution fidelity from capability transfer suggests that defenses focused only on logit precision may miss generation-based extraction routes.
The geometric view of the identified set may generalize to other partial-observation regimes such as temperature-scaled or top-p censored outputs.
Synthetic validation with known ground-truth teachers would directly test whether observed diameters match the predicted U_K.

Load-bearing premise

The censoring is a deterministic top-K selection at a fixed known threshold τ with no added noise, post-processing, or side-channel information.

What would settle it

Finding two distributions that produce identical top-K censored outputs yet differ in total variation by an amount not equal to the predicted U_K for their τ and Z_A would falsify the diameter formula.

Figures

Figures reproduced from arXiv: 2605.10407 by Binhan Luo, Haoran Zheng, Jianan Wu, Jyh-Shing Roger Jang, Wenhua Nie, Zicheng Zhu.

**Figure 1.** Figure 1: TV diameter UK and binary endpoint KL lower bound Rbin as a function of K on Qwen3- 0.6B math teacher (shading: ±1 population s.d. over 3,200 prompts). Even at K=100, UK > 0.80 and Rbin > 0.43: the vast majority of the distribution remains unresolved. decreases monotonically from 3.15 (K=1) to 0.69 (K=100), paralleling the UK curve. This is a training-averaged diagnostic: each SGD step observes many prompt… view at source ↗

**Figure 2.** Figure 2: KL closure vs. PVR across extraction methods, including on-task logit controls. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Modern LLM APIs often reveal only top-$K$ logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold $\tau$, the compatible teacher distributions form an identified set whose total-variation diameter is exactly $U_K=(V-K)\exp(\tau)/(Z_A+(V-K)\exp(\tau))$, where $Z_A$ is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-$K$ distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-$K$ censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the per-position recovery limits for teacher output distributions when an LLM API returns only the top-K logits and censors the remainder at a fixed threshold τ. It claims that the set of compatible teacher distributions forms an identified set whose total-variation diameter is exactly U_K = (V-K)exp(τ) / (Z_A + (V-K)exp(τ)), where Z_A is the observed partition function over the revealed tokens. It further supplies a computable binary-endpoint lower bound and an asymptotically tight upper bound on KL recovery, plus an extension to reference-aware attackers, and reports layered extraction results on a Qwen3 math-reasoning model (12% on-task top-K distillation, 56% full-logit, 96% generation-based).

Significance. If the central geometric claim holds, the work supplies a parameter-free, exact characterization of distributional ambiguity under deterministic top-K censoring that cleanly separates per-token fidelity limits from downstream capability transfer. The direct derivation of U_K from the definition of the identified set (without auxiliary fitting) and the concrete experimental hierarchy on Qwen3 are strengths that could inform both theoretical analyses of API leakage and practical distillation protocols.

major comments (2)

[§3 / main theorem on U_K] The derivation of the exact TV diameter U_K (stated in the abstract and presumably proved in §3) rests on the attainability of the two extremal completions (all censored logits → -∞ versus all set to τ). Please confirm in the proof that no additional ordering or normalization constraint implicit in the top-K API model can shrink the identified set below this diameter; if such a constraint exists it would be load-bearing for the exactness claim.
[KL-recovery section] The KL-recovery bounds are described as 'computable binary-endpoint lower bound' and 'asymptotically matching small-ambiguity upper bound.' The manuscript should explicitly state the precise optimization problem solved for the lower bound and the regime in which the upper bound becomes tight (e.g., as the number of censored tokens or the gap to τ grows).

minor comments (2)

[Notation / §2] Notation for Z_A (observed partition function) and V (vocabulary size) should be introduced once in the main text with a clear reminder that Z_A is computed only over the revealed top-K tokens.
[Experiments] The experimental protocol for the three recovery percentages (12%, 56%, 96%) on Qwen3 would benefit from a short appendix table listing the exact prompts, evaluation metric, and number of positions used, to make the layered hierarchy reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the encouraging recommendation for minor revision. The comments help clarify the proof and the KL section. We address each major comment below.

read point-by-point responses

Referee: [§3 / main theorem on U_K] The derivation of the exact TV diameter U_K (stated in the abstract and presumably proved in §3) rests on the attainability of the two extremal completions (all censored logits → -∞ versus all set to τ). Please confirm in the proof that no additional ordering or normalization constraint implicit in the top-K API model can shrink the identified set below this diameter; if such a constraint exists it would be load-bearing for the exactness claim.

Authors: The proof in §3 constructs the identified set directly from the top-K API access model: the observed logits for the top-K tokens are fixed, and each censored token's logit is constrained to be at most τ while ensuring the top-K selection is preserved (i.e., observed logits exceed τ). The two extremal completions—setting all censored logits to -∞ (zero probability mass) and setting them all to τ (maximum allowable mass)—are both attainable without violating the model, as there are no further ordering constraints among the censored tokens beyond the per-token upper bound of τ. We will add an explicit sentence in the proof to confirm that these boundaries are feasible and that the diameter is not reduced by implicit normalization or ordering requirements. revision: yes
Referee: [KL-recovery section] The KL-recovery bounds are described as 'computable binary-endpoint lower bound' and 'asymptotically matching small-ambiguity upper bound.' The manuscript should explicitly state the precise optimization problem solved for the lower bound and the regime in which the upper bound becomes tight (e.g., as the number of censored tokens or the gap to τ grows).

Authors: We agree that greater explicitness will improve clarity. The binary-endpoint lower bound is computed by evaluating the KL divergence at the two extremal distributions of the identified set (the -∞ and τ completions) and taking the minimum over pairs. The asymptotically matching upper bound holds in the small-ambiguity regime, specifically as the total censored mass (V - K) exp(τ) / Z_A becomes small, which occurs when either V - K is moderate but the gap to τ is large or when the observed partition function dominates. We will revise the section to state the optimization problem formally and specify the tightness regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; U_K follows directly from identified-set definition

full rationale

The paper defines the identified set as the collection of all teacher distributions p that are consistent with the observed top-K logits and censoring threshold τ under the deterministic access model. It then computes the total-variation diameter of this set as the closed-form U_K = (V-K)exp(τ)/(Z_A + (V-K)exp(τ)), where Z_A is the observed partition function. This is obtained by evaluating the TV distance between the two extremal completions (censored logits → -∞ versus censored logits = τ), which are both attainable within the model. No parameters are fitted to data, no self-citation is invoked to justify the diameter, and the expression does not reduce to any prior fitted quantity or ansatz. The subsequent KL bounds and experiments on capability extraction are independent of this geometric fact and do not rely on it circularly. The derivation is therefore self-contained mathematical geometry from the stated access model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger constructed from abstract only; full paper may introduce additional assumptions about logit access.

axioms (2)

domain assumption API returns exactly the top-K logits above a fixed threshold τ and censors the remainder
Defines the identified set and U_K formula
standard math Total variation and KL divergence are suitable metrics for quantifying distribution recovery limits
Used to state the diameter and KL bounds

pith-pipeline@v0.9.0 · 5501 in / 1188 out tokens · 64273 ms · 2026-05-12T04:54:24.806627+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

International Conference on Machine Learning (ICML) , year=

Stealing Part of a Production Language Model , author=. International Conference on Machine Learning (ICML) , year=

work page
[2]

arXiv preprint arXiv:2512.09892 , year=

Provably Learning from Modern Language Models via Low Logit Rank , author=. arXiv preprint arXiv:2512.09892 , year=

work page arXiv
[3]

arXiv preprint arXiv:2510.24966 , year=

Sequences of Logits Reveal the Low Rank Structure of Language Models , author=. arXiv preprint arXiv:2510.24966 , year=. 2510.24966 , archiveprefix=

work page arXiv
[4]

2003 , publisher=

Partial Identification of Probability Distributions , author=. 2003 , publisher=

work page 2003
[5]

Handbook of Econometrics , volume=

Microeconometrics with Partial Identification , author=. Handbook of Econometrics , volume=. 2020 , publisher=

work page 2020
[6]

USENIX Security Symposium , year=

Stealing Machine Learning Models via Prediction APIs , author=. USENIX Security Symposium , year=

work page
[7]

USENIX Security Symposium , year=

High Accuracy and High Fidelity Extraction of Neural Networks , author=. USENIX Security Symposium , year=

work page
[8]

arXiv preprint arXiv:2403.09539 , year=

Logits of API-Protected LLMs Leak Proprietary Information , author=. arXiv preprint arXiv:2403.09539 , year=

work page arXiv
[9]

International Conference on Learning Representations (ICLR) , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. International Conference on Learning Representations (ICLR) , year=

work page
[10]

MiniLLM: On-Policy Distillation of Large Language Models

MiniLLM: On-Policy Distillation of Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=. 2306.08543 , archiveprefix=

work page internal anchor Pith review arXiv
[11]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , doi=

work page 2025
[13]

Findings of EMNLP , year=

Truncation Sampling as Language Model Desmoothing , author=. Findings of EMNLP , year=

work page
[14]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=. 2108.07732 , archiveprefix=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2512.07647 , year=

A Mathematical Theory of Top-k Sparse Attention via Total Variation Distance , author=. arXiv preprint arXiv:2512.07647 , year=

work page arXiv