arxiv: 2509.17314 · v3 · submitted 2025-09-22 · 💻 cs.SE · cs.LG

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

Juyeon Yoon , Somin Kim , Robert Feldt , Shin Yoo This is my paper

Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords LLM testingtest adequacyhidden statesGaussian Mixture Modelpre-generationfailure predictionsoftware testinginput difficulty

0 comments

The pith

Clotho estimates how likely an input is to cause an LLM to fail on a task by analyzing hidden states before any output is generated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a way to judge the usefulness of test inputs for LLMs on specific tasks without running the full and costly generation process. It extracts information from the model's internal hidden states to model which inputs are difficult for the task at hand. A Gaussian Mixture Model is then fitted on a small set of human-labeled examples to learn patterns associated with likely failures. This model ranks new inputs by their predicted failure risk, allowing test selection to happen early. If the approach holds, it would let developers focus testing effort on the inputs that matter most while avoiding unnecessary LLM runs.

Core claim

Clotho estimates input difficulty directly from LLM hidden states using a Gaussian Mixture Model fitted on an adaptively sampled small reference set of human-labeled inputs, allowing it to rank unseen inputs by their likelihood of causing task failures without any output generation.

What carries the argument

A Gaussian Mixture Model applied to the hidden state representations of inputs for a specific task, which learns to separate cases by difficulty from a small labeled reference set and then scores new inputs accordingly.

If this is right

The approach can be applied across different tasks and open-weight models to select informative test inputs early.
Adequacy scores derived from open models transfer to help prioritize tests on proprietary models, finding more failures than random ordering.
Pre-generation adequacy based on hidden states works alongside existing post-generation measures that use output uncertainty.
Test selection for LLMs becomes feasible with far fewer full inferences since only a small reference set needs labeling and generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hidden states reliably signal difficulty, the same idea might apply to selecting training data or prompts for fine-tuning rather than just testing.
Teams building LLM-based tools could integrate this ranking to automatically surface edge-case inputs during development cycles.
The method invites checking whether other internal signals, such as attention patterns, could further improve the difficulty prediction for the same tasks.

Load-bearing premise

That the hidden states produced by the LLM when processing an input contain information about how difficult that input will be for the model to handle correctly on the given task, allowing a statistical model to generalize from a few labeled examples.

What would settle it

Running full generation on a batch of inputs that Clotho ranks as high failure risk and observing that they do not produce more actual failures than a random batch of the same size would show the ranking adds no value.

Figures

Figures reproduced from arXiv: 2509.17314 by Juyeon Yoon, Robert Feldt, Shin Yoo, Somin Kim.

**Figure 3.** Figure 3: Input hidden states of Llama 8B from two prompted tasks (one per line) for varying depths of layers. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Projection of prediction scores from CLOTHO, MDSA, and GT pass rates (LLaMA, N=500). We observe that CLOTHO outperforms MDSA on some tasks (e.g., JSON-FIX), while the opposite happens for other tasks (e.g., SPELL-CHK). We posit that this can be explained by the structures in input distribution [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman Rank Correlations from Different Sampling Strategies ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of selected reference points with each sampling strategy (Llama, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: 3-way Venn diagrams drawn from samples with high adequacy scores (top 25%) each from [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Average percentage of detected failures when scores from Open-weight Language Models (OLMs) are applied [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truths, forcing reliance on human judgments, while existing test adequacy measures typically rely on output uncertainty and thus are only available after full inference. A key challenge is to assess how useful a test input is in a way that reflects the demands of the task, ideally before even generating any output. We introduce Clotho, a task-specific, pre-generation test adequacy measure that estimates input difficulty directly from LLM hidden states. Given a large pool of unlabelled inputs for a specific task, Clotho uses a Gaussian Mixture Model (GMM) to adaptively sample the most informative cases for human labelling. Based on this reference set the GMM can then rank unseen inputs by their likelihood of failure. In our empirical evaluation across eight benchmark tasks and three open-weight LLMs, Clotho can predict failures with a ROC-AUC of 0.716, after labelling reference sets that are on average only 5.4% of inputs. It does so without generating any outputs, thereby significantly reducing LLM execution costs compared to output-based uncertainty or confidence measures. Comparison of Clotho and these post-generation adequacy measures shows that the two approaches complement each other. Crucially, we show that adequacy scores learnt from open-weight LLMs transfer effectively to proprietary models, extending the applicability of the approach. When prioritising test inputs for proprietary models, Clotho increases the average number of failing inputs from 18.7 to 42.5 out of 100, compared to random prioritisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Clotho gives a workable pre-generation filter for LLM test inputs via hidden states and GMM sampling, but the link to actual failure prediction looks tentative for generation tasks.

read the letter

Clotho tries to score how likely an input is to cause an LLM failure using only the hidden states from prompt encoding, then fits a GMM on a small human-labeled reference set to rank the rest. The main draw is skipping output generation entirely while still getting better-than-random selection of failing cases, with reported ROC-AUC of 0.716 after labeling roughly 5% of inputs on average. They also show the scores transfer from open-weight models to proprietary ones and complement post-generation uncertainty measures. That transfer result and the cost angle are the clearest practical contributions here. The evaluation across eight tasks and three models gives the claims some breadth that a narrower study would lack. The lift in failing cases found per 100 inputs is a concrete number worth noting for anyone running repeated LLM tests. The soft spot is the core assumption that hidden states encode task-specific difficulty signals strong enough for the GMM to extrapolate reliably. For generation-heavy work, many failures trace to sampling choices or output constraints that are invisible in a static prompt embedding. If the clusters mainly reflect input length or topic instead, the gains could be benchmark-specific rather than general. The abstract gives limited detail on controls or statistical checks, so the data-to-claim step still needs scrutiny. This is aimed at software engineering researchers building testing pipelines for LLM-based systems. Someone looking for cheaper ways to prioritize inputs before full inference would get usable ideas from it. The work shows honest engagement with the cost problem and has enough empirical grounding to go to a serious referee, even though it will likely need more experiments on why the hidden states carry the right signal. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Clotho, a pre-generation test adequacy measure for LLM inputs on specific tasks. It uses a Gaussian Mixture Model fitted on hidden states extracted while encoding a small adaptively sampled reference set (average 5.4% of inputs, human-labeled for failure) to rank unseen inputs by predicted failure likelihood. Across eight benchmark tasks and three open-weight LLMs, it reports a ROC-AUC of 0.716 without any output generation; the scores complement post-generation uncertainty measures, transfer to proprietary models, and raise the average number of failing inputs found in the top 100 from 18.7 (random) to 42.5.

Significance. If the empirical link between prompt hidden states and task-specific failure likelihood holds, Clotho could meaningfully lower the cost of testing LLM-based software by enabling pre-inference prioritization. The multi-task/multi-model evaluation, the explicit complementarity result with output-based baselines, and the transfer demonstration to closed models are concrete strengths that would support adoption in practice.

major comments (3)

[Evaluation / Results] Evaluation section (results on ROC-AUC and lift): the reported average ROC-AUC of 0.716 and the lift from 18.7 to 42.5 failing cases per 100 lack reported confidence intervals, per-task variance, or statistical significance tests against the random baseline; without these, it is difficult to judge whether the performance is robust or sensitive to the particular reference-set sampling procedure described in the method.
[Method] Method section (GMM on hidden states): the central modeling assumption is that hidden states from static prompt encoding contain separable task-specific difficulty signals; however, the paper does not report an ablation that controls for surface features such as input length or token count, which could confound the GMM clustering and undermine the claim that the 0.716 AUC reflects genuine pre-generation adequacy rather than proxy correlations.
[Evaluation / Transfer results] Transfer experiment: the claim that adequacy scores learned on open-weight models transfer effectively to proprietary models is load-bearing for the practical contribution, yet the manuscript provides no quantitative details on reference-set size, labeling protocol, or domain-shift mitigation when the GMM is applied to a different model family.

minor comments (2)

[Abstract / Method] The abstract and method description should explicitly state the number of components chosen for the GMM and the criterion used for model selection.
[Figures] Figure captions for the embedding visualizations should indicate the axes and whether the plotted points are colored by predicted or ground-truth failure labels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has identified important opportunities to strengthen the empirical rigor and clarity of our work on Clotho. We address each major comment below and commit to specific revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section (results on ROC-AUC and lift): the reported average ROC-AUC of 0.716 and the lift from 18.7 to 42.5 failing cases per 100 lack reported confidence intervals, per-task variance, or statistical significance tests against the random baseline; without these, it is difficult to judge whether the performance is robust or sensitive to the particular reference-set sampling procedure described in the method.

Authors: We agree that additional statistical reporting is necessary to demonstrate robustness. In the revised manuscript we will add 95% bootstrap confidence intervals for both the average ROC-AUC and the lift metric. We will also report per-task ROC-AUC values together with their standard deviations across repeated reference-set samplings. Finally, we will include statistical significance tests (paired Wilcoxon signed-rank tests) of Clotho against the random baseline, both per task and aggregated, and we will present sensitivity results for different reference-set sampling ratios and random seeds in an appendix. revision: yes
Referee: [Method] Method section (GMM on hidden states): the central modeling assumption is that hidden states from static prompt encoding contain separable task-specific difficulty signals; however, the paper does not report an ablation that controls for surface features such as input length or token count, which could confound the GMM clustering and undermine the claim that the 0.716 AUC reflects genuine pre-generation adequacy rather than proxy correlations.

Authors: We acknowledge that surface features could partially explain the observed clustering. To isolate the contribution of the hidden-state representations, we will add an ablation study in the revised version. The ablation will train a baseline GMM (or linear model) using only input length and token count as features and compare its ROC-AUC and lift performance directly against the full hidden-state GMM. We will report the results and discuss whether the hidden-state model retains a meaningful advantage after controlling for these surface features. revision: yes
Referee: [Evaluation / Transfer results] Transfer experiment: the claim that adequacy scores learned on open-weight models transfer effectively to proprietary models is load-bearing for the practical contribution, yet the manuscript provides no quantitative details on reference-set size, labeling protocol, or domain-shift mitigation when the GMM is applied to a different model family.

Authors: We agree that the transfer results require more explicit documentation. In the revised manuscript we will expand the transfer section to include: (i) the exact reference-set sizes used when fitting the GMM on open-weight models for subsequent application to proprietary models (following the same adaptive sampling procedure that yields the 5.4% average), (ii) the labeling protocol (human annotation of failure on the reference set), and (iii) the domain-shift mitigation steps employed, such as consistent task prompt templates and per-layer hidden-state normalization. We will also report the concrete ROC-AUC and lift numbers obtained in the transfer setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML pipeline is self-contained

full rationale

The paper describes a standard supervised modeling pipeline: human labels are collected on a small adaptive sample (5.4% of inputs) drawn via GMM on hidden states, the GMM is then fitted to those labels, and its ranking performance is measured by ROC-AUC on held-out inputs. This is ordinary cross-validation-style evaluation of a fitted predictor; the reported AUC is not equivalent to any input by construction, nor does any equation or self-citation reduce the central claim to a tautology. No self-definitional steps, fitted-input-as-prediction, or load-bearing self-citations appear in the provided abstract or method outline. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The approach implicitly relies on the standard assumption that hidden states are informative for difficulty, but this is not formalized here.

pith-pipeline@v0.9.0 · 5841 in / 1165 out tokens · 40899 ms · 2026-05-18T15:25:02.989807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLOTHO constructs a Gaussian Mixture Model over the Last-token Input Hidden States (LIHS) of passing inputs... LSA(x) = -log p_θ(h(x))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce CLOTHO, a task-specific, pre-generation test adequacy measure that estimates input difficulty directly from LLM hidden states.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

[1]

Confident AI

Principal component analysis.Wiley interdisciplinary reviews: computational statistics2, 4 (2010), 433–459. Confident AI. 2024a. DeepEval.https://github.com/confident-ai/deepeval. DAIR AI

work page 2010
[2]

The internal state of an LLM knows when it ' s lying

The Internal State of an LLM Knows When It’s Lying. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 967–976. https://doi.org/10.18653/v1/2023.findings-emnlp.68 Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, et al

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[3]

INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744(2024). T. Y . Chen, H. Leung, and I. K. Mak

work page arXiv 2024
[4]

Maher (Ed.)

Higher-Level Decision Making, Michael J. Maher (Ed.). Springer, Berlin, Heidelberg, 320–329. https: //doi.org/10.1007/978-3-540-30502-6_23 David A Cohn, Zoubin Ghahramani, and Michael I Jordan

work page doi:10.1007/978-3-540-30502-6_23
[5]

CShorten

Active learning with statistical models.Journal of artificial intelligence research4 (1996), 129–145. CShorten

work page 1996
[6]

https://huggingface.co/datasets/CShorten/ Last-Week-on-ML-ArXiv

arXiv Dataset: Hugging Face Dataset Card. https://huggingface.co/datasets/CShorten/ Last-Week-on-ML-ArXiv. Accessed: 2025-09-11. Jos de Jong

work page 2025
[7]

https://jsoneditoronline.org/ indepth/parse/fix-json/

JSON Repair: How to fix JSON and validate it with ease. https://jsoneditoronline.org/ indepth/parse/fix-json/. Accessed: 2025-09-11. Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart

work page 2025
[8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al

The mahalanobis distance.Chemometrics and intelligent laboratory systems50, 1 (2000), 1–18. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al

work page 2000
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:cs.CL/2501.12948 https://arxiv.org/abs/ 2501.12948 16 CLOTHO: Measuring Task-Specific Pre-Generation Test Adequacy for LLM InputsA PREPRINT Sebastian G. Elbaum, Alexey G. Malishevsky, and Gregg Rothermel

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.)

Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1050–1060. https://doi.org/10.1865...

work page doi:10.18653/v1/2020 2020
[11]

Christiane Fellbaum

Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (2024), 625–630. Christiane Fellbaum. 1998.WordNet: An electronic lexical database. MIT press. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, et al

work page 2024
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:cs.CL/2101.00027https://arxiv.org/abs/2101.00027 Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The Llama 3 Herd of Models

The Llama 3 Herd of Models. arXiv:cs.AI/2407.21783https://arxiv.org/abs/2407.21783 Masato Hagiwara and Masato Mita

work page internal anchor Pith review Pith/arXiv arXiv
[14]

GitHub typo corpus: A large-scale multilingual dataset of misspellings and grammatical errors.arXiv preprint arXiv:1911.12893(2019). Jakob D. Havtorn, Jes Frellsen, Søren Hauberg, and Lars Maaløe

work page arXiv 1911
[15]

InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

LLM Factoscope: Uncovering LLMs’ Factual Discernment through Measuring Inner States. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10218–10230.https://doi.org/10.18653/v1/2024.findings-acl.608 Pengcheng He, Xiaodong ...

work page doi:10.18653/v1/2024.findings-acl.608 2024
[16]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654(2020). Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, et al

work page internal anchor Pith review Pith/arXiv arXiv 2006
[17]

Albert Q

Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering (2025). Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al

work page 2025
[18]

Mistral 7B

Mistral 7B. arXiv:cs.CL/2310.06825https://arxiv.org/abs/2310.06825 Jinhan Kim, Robert Feldt, and Shin Yoo

work page internal anchor Pith review Pith/arXiv arXiv
[19]

In Proceedings of the 41th International Conference on Software Engineering (ICSE 2019)

Guiding Deep Learning System Testing using Surprise Adequacy. In Proceedings of the 41th International Conference on Software Engineering (ICSE 2019). IEEE Press, 1039–1049. Jinhan Kim, Robert Feldt, and Shin Yoo

work page 2019
[20]

Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo

Evaluating Surprise Adequacy for Deep Learning System Testing.ACM Transactions on Software Engineering and Methodology32, 2 (June 2022), 1–29. Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo

work page 2022
[21]

InProceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE Industry Track) (ESEC/FSE 2020)

Reducing DNN Labelling Cost using Surprise Adequacy: An Industrial Case Study for Autonomous Driving. InProceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE Industry Track) (ESEC/FSE 2020). 1466–1476. Jin K. Kim, Michael Chua, Mandy Rickard, and Armando Lorenzo

work page 2020
[22]

Seah Kim and Shin Yoo

ChatGPT and Large Language Model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.Journal of Pediatric Urology19, 5 (2023), 598–604. Seah Kim and Shin Yoo

work page 2023
[23]

InProceedings of the 2nd ACM/IEEE International Conference on Automated Software Testing (AST 2021)

Multimodal Surprise Adequacy Analysis of Inputs for Natural Language Processing DNN Models. InProceedings of the 2nd ACM/IEEE International Conference on Automated Software Testing (AST 2021). Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, et al

work page 2021
[24]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927(2024). LangChain

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. arXiv:cs.CL/2303.03915 https://arxiv. org/abs/2303.03915 17 CLOTHO: Measuring Task-Specific Pre-Generation Test Adequacy for LLM InputsA PREPRINT Hokyung Lee, Sumanyu Sharma, and Bing Hu

work page arXiv
[26]

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin

Bug in the code stack: Can llms find bugs in large python code stacks.arXiv preprint arXiv:2406.15325(2024). Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin

work page arXiv 2024
[27]

A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

A Simple Unified Frame- work for Detecting Out-of-Distribution Samples and Adversarial Attacks. InAdvances in Neu- ral Information Processing Systems (NeurIPS 2018). https://papers.neurips.cc/paper/ 7947-a-simple-unified-framework-for-detecting-out-of-distribution-samples-and-adversarial-attacks. pdfAlso available as arXiv:1807.03888. Lei Ma, Felix Juefei...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018)

DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY , USA, 120–131. Laurens van der Maaten and Geoffrey Hinton

work page 2018
[29]

Potsawee Manakul, Adian Liusie, and Mark JF Gales

Visualizing data using t-SNE.Journal of machine learning research9, Nov (2008), 2579–2605. Potsawee Manakul, Adian Liusie, and Mark JF Gales

work page 2008
[30]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896(2023). Roger Mitton et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Birkbeck spelling error corpus.Oxford Text Archive Legacy Collection(1980). F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, et al

work page 1980
[32]

Journal of Machine Learning Research12 (2011), 2825–2830

Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research12 (2011), 2825–2830. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana

work page 2011
[33]

InProceedings of the 26th Symposium on Operating Systems Principles (SOSP 2017)

DeepXplore: Automated Whitebox Testing of Deep Learning Systems. InProceedings of the 26th Symposium on Operating Systems Principles (SOSP 2017). 1–18. https://doi.org/10.1145/3132747.3132785 PromptFoo

work page doi:10.1145/3132747.3132785 2017
[34]

ACM computing surveys (CSUR)54, 9 (2021), 1–40

A survey of deep active learning. ACM computing surveys (CSUR)54, 9 (2021), 1–40. Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, et al

work page 2021
[35]

In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

Evaluating Agent-Based Program Repair at Google. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 365–376.https://doi.org/10.1109/ICSE-SEIP66354.2025.00038 Tobias Schnabel and Jennifer Neville

work page doi:10.1109/icse-seip66354.2025.00038 2025
[36]

Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, and Benjamin Zorn

Symbolic prompt program search: A structure-aware approach to efficient compile-time prompt optimization.arXiv preprint arXiv:2404.02319(2024). Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, and Benjamin Zorn

work page arXiv 2024
[37]

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, et al

PromptPex: Automatic Test Generation for Language Model Prompts.arXiv preprint arXiv:2503.05070(2025). Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, et al

work page arXiv 2025
[38]

InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)

A Gold Standard Dependency Corpus for English. InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar

work page 2014
[39]

On Early Detection of Hallucinations in Factual Question Answering. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, Barcelona, Spain, August 25-29, 2024, Ricardo Baeza-Yates and Francesco Bonchi (Eds.). ACM, 2721–2732.https://doi.org/10.1145/3637528.3671796 Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, ...

work page doi:10.1145/3637528.3671796 2024
[40]

InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 14379–14391.https://doi.org/10.18653/v1/2024.findings-acl.854 Gemma ...

work page doi:10.18653/v1/2024.findings-acl.854 2024
[41]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size. arXiv:cs.CL/2408.00118https://arxiv.org/abs/2408.00118 Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Advances in neural information processing systems30 (2017)

Attention is all you need. Advances in neural information processing systems30 (2017). Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, et al

work page 2017
[43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models.CoRRabs/2203.11171 (2023). John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

InProceedings of the 16th IEEE International Conference on Software Testing, Verification and Validation (ICST 2024)

Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. InProceedings of the 16th IEEE International Conference on Software Testing, Verification and Validation (ICST 2024). 129–139. Xiang Zhang, Junbo Zhao, and Yann LeCun

work page 2024
[45]

InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)

AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY , USA, 1592–1604.https://doi.org/10. 1145/3650212.3680384 Shide Zhou, Tianlin Li, Kailong Wang, Yihao Huang, Ling Shi, et al

work page arXiv 2024
[46]

In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 730–742.https://doi.org/10.1109/ICSE55347.2025.00209 19

work page doi:10.1109/icse55347.2025.00209 2025