arxiv: 2605.12201 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Uncertainty Quantification for LLM-based Code Generation

Feng Xu, Guangyuan Wu, Senrong Xu, Taolue Chen, Xiaoxing Ma, Yanke Zhou, Yuan Yao, Yuhao Tan, Zenan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:54 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords uncertainty quantificationprediction setscode generationlarge language modelsrisk controlmultiple hypothesis testingpartial programsLLM

0 comments

The pith

LLM-based code generation can produce partial programs as prediction sets guaranteed to contain a correct solution with high confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an approach to adapt prediction sets for uncertainty quantification in structured tasks like LLM code generation. It overcomes limitations of prior PAC-based methods by using multiple hypothesis testing instead of assuming monotonic risk or single valid outputs. This produces a partial program that serves as the prediction set, ensuring a correct code solution is included at a controlled risk level. A sympathetic reader would care because it provides a way to quantify uncertainty in generative models where outputs are complex and multiple correct answers exist. Experiments across three LLMs show practical gains, such as reducing the amount of code that needs removal by up to 24.5% at equivalent risk.

Core claim

Given a trained code generation model, the method leverages multiple hypothesis testing to construct risk-controlling predictions represented by a partial program that is guaranteed to contain a correct solution with high confidence, addressing the non-monotonic risk and multi-valid-output characteristics of code generation.

What carries the argument

Multiple hypothesis testing applied to construct risk-controlling partial programs as prediction sets for LLM code generation.

If this is right

The method produces prediction sets without restricting to single outputs or requiring monotonic risk.
Risk control is achieved for code generation tasks on three different LLMs.
Compared to state-of-the-art, it reduces code removal by up to 24.5% at the same risk level.
Prediction sets can be represented compactly as partial programs rather than full candidates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use the partial programs to focus completion efforts only on the uncertain parts of the code.
The approach might extend to other structured generation tasks like text or molecule generation if similar risk structures apply.
Future work could test the method on larger codebases or different risk functions to verify the guarantees hold in practice.

Load-bearing premise

Multiple hypothesis testing can be directly adapted to the non-monotonic risk structure and multi-valid-output nature of code generation without needing extra conditions on the model's output distribution or the risk function.

What would settle it

An evaluation on a held-out set of code generation prompts where the rate at which the produced partial programs exclude all correct solutions exceeds the target risk threshold.

Figures

Figures reproduced from arXiv: 2605.12201 by Feng Xu, Guangyuan Wu, Senrong Xu, Taolue Chen, Xiaoxing Ma, Yanke Zhou, Yuan Yao, Yuhao Tan, Zenan Li.

**Figure 1.** Figure 1: An illustrative example from MBPP. The left part is a correct code snippet, and the right part is a generated but incorrect one. RISCOSET removes three nodes in the AST, resulting a prediction set (i.e., a partial program) that contains the correct program. programs. Furthermore, in multi-label settings, we need to sample additional candidate programs from LLMs and verify their correctness via test-case ex… view at source ↗

**Figure 2.** Figure 2: Percentage of node removals (top row) and satisfying code sets (bottom row) w.r.t. risk level α. The results are the mean over 100 random splits. The smaller node removal is better, when the code set coverage exceeds the target bound 1 − α. Our approach constructs prediction sets that remove significantly fewer nodes compared to baselines for three LLMs on all datasets. 0.05 0.10 0.15 0.20 0.25 Risk Level … view at source ↗

**Figure 3.** Figure 3: Parameter sensitivity analysis of sampling quantity m w.r.t. risk level α on MBPP. The average results over 100 trials show that a larger value of m leads to fewer node removals, while maintaining the required risk control. ERROR ϵ 0.05 0.1 0.2 0.3 REMOVAL 71.8 50.6 27.8 3.39 COVERAGE 88.5 82.9 74.1 66.5 (85.5) (81.0) (72.0) (63.0) SAVE 12.0 29.4 57.6 84.4 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: An illustrative example from HumanEval. The top part is a correct code snippet, and the bottom part is a generated but incorrect one. RISCOSET removes one node in the AST, resulting a prediction set (i.e., a partial program) that contains the correct program. return str((H - h) * (W - w)) root return str() * - - H h W w return str((H + h) * (W + w)) root return str() * + + H h W w [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 6.** Figure 6: An illustrative example from APPS. The left part is a correct code snippet, and the right part is a generated but incorrect one. RISCOSET removes seven nodes in the AST, resulting a prediction set (i.e., a partial program) that contains the correct program. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: An illustrative example from APPS. The top part is a correct code snippet, and the bottem part is a generated but incorrect one. RISCOSET removes three nodes in the AST (we omit the overlapping AST structures for brevity), resulting a prediction set (i.e., a partial program) that contains the correct program. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt proposes PAC prediction sets but is limited by its strong monotonicity assumption on risk and single-label classification framework, which severely limits the space of candidate programs and cannot accommodate the multiple valid outputs inherent to code generation. To address these limitations, we propose an approach RisCoSet that leverages multiple hypothesis testing to construct risk-controlling predictions for LLM-based code generation. Given a trained code generation model, we produce a prediction set represented by a partial program, which is guaranteed to contain a correct solution with high confidence. Extensive experiments on three LLMs demonstrate the effectiveness of the proposed method. For instance, compared with the state-of-the-art, our method can significantly reduce the code removal by up to 24.5%, at the same level of risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RisCoSet uses multiple hypothesis testing on partial programs to drop the monotonicity requirement from prior PAC sets for code generation, but the coverage guarantee rests on unshown details about risk functions and hypothesis dependence.

read the letter

The main takeaway is that this paper takes the PAC prediction set idea for LLMs and replaces the monotonicity assumption with multiple hypothesis testing so the output can be a partial program that still carries a risk guarantee of containing a correct completion. That move lets them handle the fact that code has many valid completions and that adding tokens does not always move risk in one direction. The experiments back this up on three different LLMs, cutting the amount of code that has to be removed by as much as 24.5 percent while keeping the reported risk level the same as the baseline. That is a concrete efficiency gain for anyone who wants to use these sets in a coding assistant without throwing away too much of the model's output. The partial-program representation itself is a useful way to represent the set without enumerating full programs. The paper is honest about the limits of the earlier monotonic approach and tries to fix it directly. The soft spot is that the abstract and summary give no derivation or even a short argument showing how the p-values are constructed or how the procedure controls family-wise error when partial programs share prefixes and the risk of having no correct completion is non-monotonic. Standard multiple-testing theorems do not apply off the shelf in that setting, so the guarantee only follows if they have additional structure or a custom argument that is not visible here. The experiments are summarized at a high level without error bars or full protocol details, which makes it hard to judge how stable the 24.5 percent figure is. Readers who work on uncertainty quantification for structured generation will find the idea worth reading, especially if they already know the PAC set literature. It is not a finished theoretical result, but the problem it targets is practical and the empirical direction is clear. I would send it to peer review. The core idea is worth referee time to check the missing steps on the risk control and to see the full experimental setup.

Referee Report

2 major / 2 minor

Summary. The paper proposes RisCoSet, which adapts multiple hypothesis testing to construct risk-controlling prediction sets for LLM-based code generation. The sets are represented as partial programs guaranteed to contain at least one correct solution with probability at least 1-α. This relaxes the strong monotonicity assumption and single-valid-output restriction of prior PAC prediction sets. Experiments on three LLMs report up to 24.5% reduction in code removal compared to the state-of-the-art at equivalent risk levels.

Significance. If the coverage guarantee is valid, the work would meaningfully extend conformal-style uncertainty quantification to structured, multi-output generation tasks where monotonicity fails. The empirical reduction in removed code suggests practical value for code-completion tools. However, the absence of a derivation or explicit verification that the risk function satisfies the conditions for valid p-value construction and family-wise error control under dependence and non-monotonicity limits the assessed significance.

major comments (2)

[Abstract] Abstract: The guarantee that the partial-program prediction set 'is guaranteed to contain a correct solution with high confidence' is asserted via multiple hypothesis testing, yet no derivation, proof sketch, or definition of the risk function (probability that a partial program has no correct completion) is supplied. This is load-bearing for the central claim, as standard multiple-testing theorems require conditions on the risk function and hypothesis dependence that the skeptic note indicates are likely violated by non-monotonic code-generation risk.
[Experimental Evaluation] Experimental Evaluation: The reported gains (e.g., 24.5% reduction in code removal) are summarized without error bars, full methodology for applying the multiple-testing procedure to code outputs, or details on how p-values are computed from the LLM's output distribution. This prevents assessment of whether the empirical results actually support the claimed risk control.

minor comments (2)

The abstract refers to 'three LLMs' and 'state-of-the-art' without naming the models, datasets, or baseline methods; adding these in the experiments section would improve reproducibility.
Notation for the partial-program prediction set and the risk function could be introduced with a small concrete example early in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments highlight important areas for clarification on the theoretical foundations and experimental reporting. We address each point below and will revise the manuscript accordingly to strengthen the presentation of the risk-control guarantees and empirical methodology.

read point-by-point responses

Referee: [Abstract] Abstract: The guarantee that the partial-program prediction set 'is guaranteed to contain a correct solution with high confidence' is asserted via multiple hypothesis testing, yet no derivation, proof sketch, or definition of the risk function (probability that a partial program has no correct completion) is supplied. This is load-bearing for the central claim, as standard multiple-testing theorems require conditions on the risk function and hypothesis dependence that the skeptic note indicates are likely violated by non-monotonic code-generation risk.

Authors: We agree that an explicit definition of the risk function and a derivation of the coverage guarantee are necessary for the central claim. The risk function is defined as the probability that a given partial program admits no correct completion under the data distribution. RisCoSet constructs hypotheses over candidate completions of the partial program and applies a multiple-testing procedure (controlling family-wise error) to ensure that, with probability at least 1-α, the retained partial program has at least one valid completion. While the manuscript states the high-level adaptation, we acknowledge the absence of a self-contained proof sketch addressing dependence and non-monotonicity. In revision we will add a dedicated subsection with the formal definition, the precise hypothesis construction, and a proof outline showing why the standard multiple-testing conditions hold under our partial-program representation (which relaxes monotonicity by design). revision: yes
Referee: [Experimental Evaluation] Experimental Evaluation: The reported gains (e.g., 24.5% reduction in code removal) are summarized without error bars, full methodology for applying the multiple-testing procedure to code outputs, or details on how p-values are computed from the LLM's output distribution. This prevents assessment of whether the empirical results actually support the claimed risk control.

Authors: We accept that the experimental section requires additional detail to allow readers to verify the risk-control claims. The 24.5% figure is the maximum observed reduction across the three LLMs and datasets at matched risk levels; however, we did not report variability across random seeds or full implementation steps. In the revision we will: (i) add error bars computed over 5 independent calibration/test splits, (ii) provide a step-by-step description of how the multiple-testing procedure is instantiated on token sequences (including the exact mapping from LLM logits to per-hypothesis p-values), and (iii) include pseudocode for the p-value computation and the partial-program pruning step. These additions will make the empirical support for risk control transparent. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation adapts established multiple hypothesis testing without self-referential reduction

full rationale

The paper's central construction of RisCoSet uses multiple hypothesis testing to produce risk-controlling partial-program prediction sets. This follows directly from standard theorems on family-wise error control and p-value construction under the stated risk function, without any equation or step reducing the coverage guarantee to a fitted parameter, self-definition, or prior self-citation that itself depends on the target result. No load-bearing ansatz, uniqueness theorem, or renaming of known results is introduced via self-reference. The method is self-contained against external benchmarks in conformal prediction and hypothesis testing literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on statistical assumptions about risk control via multiple testing applied to LLM outputs; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Multiple hypothesis testing procedures can control the risk of missing a correct program in structured generation tasks
Invoked to justify the guarantee for partial-program prediction sets

invented entities (1)

RisCoSet no independent evidence
purpose: Risk-controlling prediction set construction for LLM code generation
New method name and framework introduced to overcome prior limitations

pith-pipeline@v0.9.0 · 5482 in / 1193 out tokens · 47816 ms · 2026-05-13T03:54:52.632730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 9 internal anchors

[1]

The Annals of Applied Statistics , volume=

Learn then test: Calibrating predictive algorithms to achieve risk control , author=. The Annals of Applied Statistics , volume=. 2025 , publisher=

work page 2025
[2]

International Conference on Machine Learning , pages=

PAC prediction sets for large language models of code , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[3]

Journal of the ACM (JACM) , volume=

Distribution-free, risk-controlling prediction sets , author=. Journal of the ACM (JACM) , volume=. 2021 , publisher=

work page 2021
[4]

Journal of the American statistical association , volume=

Probability inequalities for sums of bounded random variables , author=. Journal of the American statistical association , volume=. 1963 , publisher=

work page 1963
[5]

On Hoeffding’s inequalities , author=

work page
[6]

Statistics in medicine , volume=

Multiple testing in clinical trials , author=. Statistics in medicine , volume=. 1991 , publisher=

work page 1991
[7]

Scandinavian journal of statistics , pages=

A simple sequentially rejective multiple test procedure , author=. Scandinavian journal of statistics , pages=. 1979 , publisher=

work page 1979
[8]

International Conference on Learning Representations , year=

Pac confidence sets for deep neural networks via calibrated prediction , author=. International Conference on Learning Representations , year=

work page
[9]

arXiv preprint arXiv:2506.10908 , year=

Probably Approximately Correct Labels , author=. arXiv preprint arXiv:2506.10908 , year=

work page arXiv
[10]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2207.10397 , year=

Codet: Code generation with generated tests , author=. arXiv preprint arXiv:2207.10397 , year=

work page arXiv
[13]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2306.08568 , year=

Wizardcoder: Empowering code large language models with evol-instruct , author=. arXiv preprint arXiv:2306.08568 , year=

work page arXiv
[16]

Proceedings of the American Mathematical Society , volume=

The Lindeberg-Levy theorem for martingales , author=. Proceedings of the American Mathematical Society , volume=

work page
[17]

Information fusion , volume=

A review of uncertainty quantification in deep learning: Techniques, applications and challenges , author=. Information fusion , volume=. 2021 , publisher=

work page 2021
[18]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

work page 2016
[19]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[20]

Advances in neural information processing systems , volume=

Conformalized quantile regression , author=. Advances in neural information processing systems , volume=

work page
[21]

Foundations and Trends

Conformal prediction: A gentle introduction , author=. Foundations and Trends. 2023 , publisher=

work page 2023
[22]

Journal of the American Statistical Association , volume=

Least ambiguous set-valued classifiers with bounded error levels , author=. Journal of the American Statistical Association , volume=. 2019 , publisher=

work page 2019
[23]

The Eleventh International Conference on Learning Representations , year=

Predictive inference with feature conformal prediction , author=. The Eleventh International Conference on Learning Representations , year=

work page
[24]

Conformal prediction with large language models for multi-choice question answering

Conformal prediction with large language models for multi-choice question answering , author=. arXiv preprint arXiv:2305.18404 , year=

work page arXiv
[25]

Calibration of Pre-trained Transformers

Desai, Shrey and Durrett, Greg. Calibration of Pre-trained Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.21

work page doi:10.18653/v1/2020.emnlp-main.21 2020
[26]

ArXiv , year=

Language Models (Mostly) Know What They Know , author=. ArXiv , year=

work page
[27]

arXiv preprint arXiv:2105.11098 , year=

Prevent the language model from being overconfident in neural machine translation , author=. arXiv preprint arXiv:2105.11098 , year=

work page arXiv
[28]

arXiv preprint arXiv:2302.07248 , year=

Generation probabilities are not enough: Exploring the effectiveness of uncertainty highlighting in AI-powered code completions , author=. arXiv preprint arXiv:2302.07248 , year=

work page arXiv
[29]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[30]

2022 , eprint=

Teaching Models to Express Their Uncertainty in Words , author=. 2022 , eprint=

work page 2022
[31]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[32]

The Twelfth International Conference on Learning Representations , year=

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , author=. The Twelfth International Conference on Learning Representations , year=

work page
[33]

2025 , eprint=

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models , author=. 2025 , eprint=

work page 2025
[34]

IEEE Transactions on Software Engineering , year=

Look before you leap: An exploratory study of uncertainty analysis for large language models , author=. IEEE Transactions on Software Engineering , year=

work page
[35]

A Theoretical Study on Bridging Internal Probability and Self-Consistency for

Zhi Zhou and Yuhao Tan and Zenan Li and Yuan Yao and Lan-Zhe Guo and Yu-Feng Li and Xiaoxing Ma , booktitle=. A Theoretical Study on Bridging Internal Probability and Self-Consistency for. 2025 , url=

work page 2025
[36]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps , author=. arXiv preprint arXiv:2105.09938 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Relic: Investigating large language model responses using self-consistency , year =

Cheng, Furui and Zouhar, Vil. Relic: Investigating large language model responses using self-consistency , year =. Proceedings of the CHI Conference on Human Factors in Computing Systems , date-added =

work page
[39]

2022 , eprint=

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference , author=. 2022 , eprint=

work page 2022
[40]

2023 , eprint=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=

work page 2023
[41]

2023 , eprint=

Conformal Nucleus Sampling , author=. 2023 , eprint=

work page 2023
[42]

2024 , eprint=

Conformal Language Modeling , author=. 2024 , eprint=

work page 2024
[43]

2023 , eprint=

Conformal Prediction with Large Language Models for Multi-Choice Question Answering , author=. 2023 , eprint=

work page 2023
[44]

2024 , eprint=

Language Models with Conformal Factuality Guarantees , author=. 2024 , eprint=

work page 2024
[45]

2024 , eprint=

Large language model validity via enhanced conformal prediction methods , author=. 2024 , eprint=

work page 2024
[46]

2024 , eprint=

Benchmarking LLMs via Uncertainty Quantification , author=. 2024 , eprint=

work page 2024
[47]

and Sandler, Corey and Badgett, Tom , biburl =

Myers, Glenford J. and Sandler, Corey and Badgett, Tom , biburl =

work page
[48]

2000 , url=

Testing and Analysis : Process , Principles , and Techniques , author=. 2000 , url=

work page 2000
[49]

2021 , issue_date =

Park, Jihyeok and Lee, Hongki and Ryu, Sukyoung , title =. 2021 , issue_date =. doi:10.1145/3464457 , journal =

work page doi:10.1145/3464457 2021
[50]

2008 , url=

Principles of model checking , author=. 2008 , url=

work page 2008
[51]

2022 , issue_date =

Zhu, Xiaogang and Wen, Sheng and Camtepe, Seyit and Xiang, Yang , title =. 2022 , issue_date =. doi:10.1145/3512345 , journal =

work page doi:10.1145/3512345 2022
[52]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

arXiv preprint arXiv:2404.09785 , year=

Benchmarking llama2, mistral, gemma and gpt for factuality, toxicity, bias and propensity for hallucinations , author=. arXiv preprint arXiv:2404.09785 , year=

work page arXiv
[56]

arXiv preprint arXiv:2404.00971 , year=

Exploring and evaluating hallucinations in llm-powered code generation , author=. arXiv preprint arXiv:2404.00971 , year=

work page arXiv
[57]

Journal of Legal Analysis , volume=

Large legal fictions: Profiling legal hallucinations in large language models , author=. Journal of Legal Analysis , volume=. 2024 , publisher=

work page 2024
[58]

arXiv preprint arXiv:2208.02814 , year=

Conformal risk control , author=. arXiv preprint arXiv:2208.02814 , year=

work page arXiv
[59]

2005 , publisher=

Algorithmic learning in a random world , author=. 2005 , publisher=

work page 2005
[60]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

Conformal and probabilistic prediction with applications , pages=

A review of nonconformity measures for conformal prediction in regression , author=. Conformal and probabilistic prediction with applications , pages=. 2023 , publisher=

work page 2023
[62]

Empirical Software Engineering , volume=

Studying the difference between natural and programming language corpora , author=. Empirical Software Engineering , volume=. 2019 , publisher=

work page 2019
[63]

arXiv preprint arXiv:2106.10158 , year=

Learning to complete code with sketches , author=. arXiv preprint arXiv:2106.10158 , year=

work page arXiv
[64]

Proceedings of the 2018 ACM SIGSAC conference on computer and communications security , pages=

Evaluating fuzz testing , author=. Proceedings of the 2018 ACM SIGSAC conference on computer and communications security , pages=

work page 2018
[65]

2008 , publisher=

Program synthesis by sketching , author=. 2008 , publisher=

work page 2008
[66]

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

Boosting complete-code tool for partial program , author=. 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2017 , organization=

work page 2017