pith. machine review for the scientific record.
sign in

arxiv: 2509.17314 · v3 · submitted 2025-09-22 · 💻 cs.SE · cs.LG

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords LLM testingtest adequacyhidden statesGaussian Mixture Modelpre-generationfailure predictionsoftware testinginput difficulty
0
0 comments X

The pith

Clotho estimates how likely an input is to cause an LLM to fail on a task by analyzing hidden states before any output is generated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a way to judge the usefulness of test inputs for LLMs on specific tasks without running the full and costly generation process. It extracts information from the model's internal hidden states to model which inputs are difficult for the task at hand. A Gaussian Mixture Model is then fitted on a small set of human-labeled examples to learn patterns associated with likely failures. This model ranks new inputs by their predicted failure risk, allowing test selection to happen early. If the approach holds, it would let developers focus testing effort on the inputs that matter most while avoiding unnecessary LLM runs.

Core claim

Clotho estimates input difficulty directly from LLM hidden states using a Gaussian Mixture Model fitted on an adaptively sampled small reference set of human-labeled inputs, allowing it to rank unseen inputs by their likelihood of causing task failures without any output generation.

What carries the argument

A Gaussian Mixture Model applied to the hidden state representations of inputs for a specific task, which learns to separate cases by difficulty from a small labeled reference set and then scores new inputs accordingly.

If this is right

  • The approach can be applied across different tasks and open-weight models to select informative test inputs early.
  • Adequacy scores derived from open models transfer to help prioritize tests on proprietary models, finding more failures than random ordering.
  • Pre-generation adequacy based on hidden states works alongside existing post-generation measures that use output uncertainty.
  • Test selection for LLMs becomes feasible with far fewer full inferences since only a small reference set needs labeling and generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If hidden states reliably signal difficulty, the same idea might apply to selecting training data or prompts for fine-tuning rather than just testing.
  • Teams building LLM-based tools could integrate this ranking to automatically surface edge-case inputs during development cycles.
  • The method invites checking whether other internal signals, such as attention patterns, could further improve the difficulty prediction for the same tasks.

Load-bearing premise

That the hidden states produced by the LLM when processing an input contain information about how difficult that input will be for the model to handle correctly on the given task, allowing a statistical model to generalize from a few labeled examples.

What would settle it

Running full generation on a batch of inputs that Clotho ranks as high failure risk and observing that they do not produce more actual failures than a random batch of the same size would show the ranking adds no value.

Figures

Figures reproduced from arXiv: 2509.17314 by Juyeon Yoon, Robert Feldt, Shin Yoo, Somin Kim.

Figure 1
Figure 1. Figure 1: Visualisations on Llama 8B internal state space [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Input hidden states of Llama 8B from two prompted tasks (one per line) for varying depths of layers. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Projection of prediction scores from CLOTHO, MDSA, and GT pass rates (LLaMA, N=500). We observe that CLOTHO outperforms MDSA on some tasks (e.g., JSON-FIX), while the opposite happens for other tasks (e.g., SPELL-CHK). We posit that this can be explained by the structures in input distribution [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spearman Rank Correlations from Different Sampling Strategies ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of selected reference points with each sampling strategy (Llama, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: 3-way Venn diagrams drawn from samples with high adequacy scores (top 25%) each from [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average percentage of detected failures when scores from Open-weight Language Models (OLMs) are applied [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truths, forcing reliance on human judgments, while existing test adequacy measures typically rely on output uncertainty and thus are only available after full inference. A key challenge is to assess how useful a test input is in a way that reflects the demands of the task, ideally before even generating any output. We introduce Clotho, a task-specific, pre-generation test adequacy measure that estimates input difficulty directly from LLM hidden states. Given a large pool of unlabelled inputs for a specific task, Clotho uses a Gaussian Mixture Model (GMM) to adaptively sample the most informative cases for human labelling. Based on this reference set the GMM can then rank unseen inputs by their likelihood of failure. In our empirical evaluation across eight benchmark tasks and three open-weight LLMs, Clotho can predict failures with a ROC-AUC of 0.716, after labelling reference sets that are on average only 5.4% of inputs. It does so without generating any outputs, thereby significantly reducing LLM execution costs compared to output-based uncertainty or confidence measures. Comparison of Clotho and these post-generation adequacy measures shows that the two approaches complement each other. Crucially, we show that adequacy scores learnt from open-weight LLMs transfer effectively to proprietary models, extending the applicability of the approach. When prioritising test inputs for proprietary models, Clotho increases the average number of failing inputs from 18.7 to 42.5 out of 100, compared to random prioritisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Clotho, a pre-generation test adequacy measure for LLM inputs on specific tasks. It uses a Gaussian Mixture Model fitted on hidden states extracted while encoding a small adaptively sampled reference set (average 5.4% of inputs, human-labeled for failure) to rank unseen inputs by predicted failure likelihood. Across eight benchmark tasks and three open-weight LLMs, it reports a ROC-AUC of 0.716 without any output generation; the scores complement post-generation uncertainty measures, transfer to proprietary models, and raise the average number of failing inputs found in the top 100 from 18.7 (random) to 42.5.

Significance. If the empirical link between prompt hidden states and task-specific failure likelihood holds, Clotho could meaningfully lower the cost of testing LLM-based software by enabling pre-inference prioritization. The multi-task/multi-model evaluation, the explicit complementarity result with output-based baselines, and the transfer demonstration to closed models are concrete strengths that would support adoption in practice.

major comments (3)
  1. [Evaluation / Results] Evaluation section (results on ROC-AUC and lift): the reported average ROC-AUC of 0.716 and the lift from 18.7 to 42.5 failing cases per 100 lack reported confidence intervals, per-task variance, or statistical significance tests against the random baseline; without these, it is difficult to judge whether the performance is robust or sensitive to the particular reference-set sampling procedure described in the method.
  2. [Method] Method section (GMM on hidden states): the central modeling assumption is that hidden states from static prompt encoding contain separable task-specific difficulty signals; however, the paper does not report an ablation that controls for surface features such as input length or token count, which could confound the GMM clustering and undermine the claim that the 0.716 AUC reflects genuine pre-generation adequacy rather than proxy correlations.
  3. [Evaluation / Transfer results] Transfer experiment: the claim that adequacy scores learned on open-weight models transfer effectively to proprietary models is load-bearing for the practical contribution, yet the manuscript provides no quantitative details on reference-set size, labeling protocol, or domain-shift mitigation when the GMM is applied to a different model family.
minor comments (2)
  1. [Abstract / Method] The abstract and method description should explicitly state the number of components chosen for the GMM and the criterion used for model selection.
  2. [Figures] Figure captions for the embedding visualizations should indicate the axes and whether the plotted points are colored by predicted or ground-truth failure labels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has identified important opportunities to strengthen the empirical rigor and clarity of our work on Clotho. We address each major comment below and commit to specific revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Evaluation / Results] Evaluation section (results on ROC-AUC and lift): the reported average ROC-AUC of 0.716 and the lift from 18.7 to 42.5 failing cases per 100 lack reported confidence intervals, per-task variance, or statistical significance tests against the random baseline; without these, it is difficult to judge whether the performance is robust or sensitive to the particular reference-set sampling procedure described in the method.

    Authors: We agree that additional statistical reporting is necessary to demonstrate robustness. In the revised manuscript we will add 95% bootstrap confidence intervals for both the average ROC-AUC and the lift metric. We will also report per-task ROC-AUC values together with their standard deviations across repeated reference-set samplings. Finally, we will include statistical significance tests (paired Wilcoxon signed-rank tests) of Clotho against the random baseline, both per task and aggregated, and we will present sensitivity results for different reference-set sampling ratios and random seeds in an appendix. revision: yes

  2. Referee: [Method] Method section (GMM on hidden states): the central modeling assumption is that hidden states from static prompt encoding contain separable task-specific difficulty signals; however, the paper does not report an ablation that controls for surface features such as input length or token count, which could confound the GMM clustering and undermine the claim that the 0.716 AUC reflects genuine pre-generation adequacy rather than proxy correlations.

    Authors: We acknowledge that surface features could partially explain the observed clustering. To isolate the contribution of the hidden-state representations, we will add an ablation study in the revised version. The ablation will train a baseline GMM (or linear model) using only input length and token count as features and compare its ROC-AUC and lift performance directly against the full hidden-state GMM. We will report the results and discuss whether the hidden-state model retains a meaningful advantage after controlling for these surface features. revision: yes

  3. Referee: [Evaluation / Transfer results] Transfer experiment: the claim that adequacy scores learned on open-weight models transfer effectively to proprietary models is load-bearing for the practical contribution, yet the manuscript provides no quantitative details on reference-set size, labeling protocol, or domain-shift mitigation when the GMM is applied to a different model family.

    Authors: We agree that the transfer results require more explicit documentation. In the revised manuscript we will expand the transfer section to include: (i) the exact reference-set sizes used when fitting the GMM on open-weight models for subsequent application to proprietary models (following the same adaptive sampling procedure that yields the 5.4% average), (ii) the labeling protocol (human annotation of failure on the reference set), and (iii) the domain-shift mitigation steps employed, such as consistent task prompt templates and per-layer hidden-state normalization. We will also report the concrete ROC-AUC and lift numbers obtained in the transfer setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML pipeline is self-contained

full rationale

The paper describes a standard supervised modeling pipeline: human labels are collected on a small adaptive sample (5.4% of inputs) drawn via GMM on hidden states, the GMM is then fitted to those labels, and its ranking performance is measured by ROC-AUC on held-out inputs. This is ordinary cross-validation-style evaluation of a fitted predictor; the reported AUC is not equivalent to any input by construction, nor does any equation or self-citation reduce the central claim to a tautology. No self-definitional steps, fitted-input-as-prediction, or load-bearing self-citations appear in the provided abstract or method outline. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The approach implicitly relies on the standard assumption that hidden states are informative for difficulty, but this is not formalized here.

pith-pipeline@v0.9.0 · 5841 in / 1165 out tokens · 40899 ms · 2026-05-18T15:25:02.989807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

  1. [1]

    Confident AI

    Principal component analysis.Wiley interdisciplinary reviews: computational statistics2, 4 (2010), 433–459. Confident AI. 2024a. DeepEval.https://github.com/confident-ai/deepeval. DAIR AI

  2. [2]

    The internal state of an LLM knows when it ' s lying

    The Internal State of an LLM Knows When It’s Lying. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 967–976. https://doi.org/10.18653/v1/2023.findings-emnlp.68 Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, et al

  3. [3]

    INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744(2024). T. Y . Chen, H. Leung, and I. K. Mak

  4. [4]

    Maher (Ed.)

    Higher-Level Decision Making, Michael J. Maher (Ed.). Springer, Berlin, Heidelberg, 320–329. https: //doi.org/10.1007/978-3-540-30502-6_23 David A Cohn, Zoubin Ghahramani, and Michael I Jordan

  5. [5]

    CShorten

    Active learning with statistical models.Journal of artificial intelligence research4 (1996), 129–145. CShorten

  6. [6]

    https://huggingface.co/datasets/CShorten/ Last-Week-on-ML-ArXiv

    arXiv Dataset: Hugging Face Dataset Card. https://huggingface.co/datasets/CShorten/ Last-Week-on-ML-ArXiv. Accessed: 2025-09-11. Jos de Jong

  7. [7]

    https://jsoneditoronline.org/ indepth/parse/fix-json/

    JSON Repair: How to fix JSON and validate it with ease. https://jsoneditoronline.org/ indepth/parse/fix-json/. Accessed: 2025-09-11. Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart

  8. [8]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al

    The mahalanobis distance.Chemometrics and intelligent laboratory systems50, 1 (2000), 1–18. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:cs.CL/2501.12948 https://arxiv.org/abs/ 2501.12948 16 CLOTHO: Measuring Task-Specific Pre-Generation Test Adequacy for LLM InputsA PREPRINT Sebastian G. Elbaum, Alexey G. Malishevsky, and Gregg Rothermel

  10. [10]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.)

    Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1050–1060. https://doi.org/10.1865...

  11. [11]

    Christiane Fellbaum

    Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (2024), 625–630. Christiane Fellbaum. 1998.WordNet: An electronic lexical database. MIT press. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, et al

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:cs.CL/2101.00027https://arxiv.org/abs/2101.00027 Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al

  13. [13]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models. arXiv:cs.AI/2407.21783https://arxiv.org/abs/2407.21783 Masato Hagiwara and Masato Mita

  14. [14]

    GitHub typo corpus: A large-scale multilingual dataset of misspellings and grammatical errors.arXiv preprint arXiv:1911.12893(2019). Jakob D. Havtorn, Jes Frellsen, Søren Hauberg, and Lars Maaløe

  15. [15]

    InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

    LLM Factoscope: Uncovering LLMs’ Factual Discernment through Measuring Inner States. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10218–10230.https://doi.org/10.18653/v1/2024.findings-acl.608 Pengcheng He, Xiaodong ...

  16. [16]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654(2020). Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, et al

  17. [17]

    Albert Q

    Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering (2025). Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al

  18. [18]

    Mistral 7B

    Mistral 7B. arXiv:cs.CL/2310.06825https://arxiv.org/abs/2310.06825 Jinhan Kim, Robert Feldt, and Shin Yoo

  19. [19]

    In Proceedings of the 41th International Conference on Software Engineering (ICSE 2019)

    Guiding Deep Learning System Testing using Surprise Adequacy. In Proceedings of the 41th International Conference on Software Engineering (ICSE 2019). IEEE Press, 1039–1049. Jinhan Kim, Robert Feldt, and Shin Yoo

  20. [20]

    Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo

    Evaluating Surprise Adequacy for Deep Learning System Testing.ACM Transactions on Software Engineering and Methodology32, 2 (June 2022), 1–29. Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo

  21. [21]

    InProceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE Industry Track) (ESEC/FSE 2020)

    Reducing DNN Labelling Cost using Surprise Adequacy: An Industrial Case Study for Autonomous Driving. InProceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE Industry Track) (ESEC/FSE 2020). 1466–1476. Jin K. Kim, Michael Chua, Mandy Rickard, and Armando Lorenzo

  22. [22]

    Seah Kim and Shin Yoo

    ChatGPT and Large Language Model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.Journal of Pediatric Urology19, 5 (2023), 598–604. Seah Kim and Shin Yoo

  23. [23]

    InProceedings of the 2nd ACM/IEEE International Conference on Automated Software Testing (AST 2021)

    Multimodal Surprise Adequacy Analysis of Inputs for Natural Language Processing DNN Models. InProceedings of the 2nd ACM/IEEE International Conference on Automated Software Testing (AST 2021). Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, et al

  24. [24]

    Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

    Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927(2024). LangChain

  25. [25]

    InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118

    The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. arXiv:cs.CL/2303.03915 https://arxiv. org/abs/2303.03915 17 CLOTHO: Measuring Task-Specific Pre-Generation Test Adequacy for LLM InputsA PREPRINT Hokyung Lee, Sumanyu Sharma, and Bing Hu

  26. [26]

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin

    Bug in the code stack: Can llms find bugs in large python code stacks.arXiv preprint arXiv:2406.15325(2024). Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin

  27. [27]

    A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

    A Simple Unified Frame- work for Detecting Out-of-Distribution Samples and Adversarial Attacks. InAdvances in Neu- ral Information Processing Systems (NeurIPS 2018). https://papers.neurips.cc/paper/ 7947-a-simple-unified-framework-for-detecting-out-of-distribution-samples-and-adversarial-attacks. pdfAlso available as arXiv:1807.03888. Lei Ma, Felix Juefei...

  28. [28]

    InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018)

    DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY , USA, 120–131. Laurens van der Maaten and Geoffrey Hinton

  29. [29]

    Potsawee Manakul, Adian Liusie, and Mark JF Gales

    Visualizing data using t-SNE.Journal of machine learning research9, Nov (2008), 2579–2605. Potsawee Manakul, Adian Liusie, and Mark JF Gales

  30. [30]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896(2023). Roger Mitton et al

  31. [31]

    Birkbeck spelling error corpus.Oxford Text Archive Legacy Collection(1980). F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, et al

  32. [32]

    Journal of Machine Learning Research12 (2011), 2825–2830

    Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research12 (2011), 2825–2830. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana

  33. [33]

    InProceedings of the 26th Symposium on Operating Systems Principles (SOSP 2017)

    DeepXplore: Automated Whitebox Testing of Deep Learning Systems. InProceedings of the 26th Symposium on Operating Systems Principles (SOSP 2017). 1–18. https://doi.org/10.1145/3132747.3132785 PromptFoo

  34. [34]

    ACM computing surveys (CSUR)54, 9 (2021), 1–40

    A survey of deep active learning. ACM computing surveys (CSUR)54, 9 (2021), 1–40. Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, et al

  35. [35]

    In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

    Evaluating Agent-Based Program Repair at Google. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 365–376.https://doi.org/10.1109/ICSE-SEIP66354.2025.00038 Tobias Schnabel and Jennifer Neville

  36. [36]

    Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, and Benjamin Zorn

    Symbolic prompt program search: A structure-aware approach to efficient compile-time prompt optimization.arXiv preprint arXiv:2404.02319(2024). Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, and Benjamin Zorn

  37. [37]

    Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, et al

    PromptPex: Automatic Test Generation for Language Model Prompts.arXiv preprint arXiv:2503.05070(2025). Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, et al

  38. [38]

    InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)

    A Gold Standard Dependency Corpus for English. InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar

  39. [39]

    On Early Detection of Hallucinations in Factual Question Answering. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, Barcelona, Spain, August 25-29, 2024, Ricardo Baeza-Yates and Francesco Bonchi (Eds.). ACM, 2721–2732.https://doi.org/10.1145/3637528.3671796 Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, ...

  40. [40]

    InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

    Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 14379–14391.https://doi.org/10.18653/v1/2024.findings-acl.854 Gemma ...

  41. [41]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving Open Language Models at a Practical Size. arXiv:cs.CL/2408.00118https://arxiv.org/abs/2408.00118 Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray

  42. [42]

    Advances in neural information processing systems30 (2017)

    Attention is all you need. Advances in neural information processing systems30 (2017). Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, et al

  43. [43]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-Consistency Improves Chain of Thought Reasoning in Language Models.CoRRabs/2203.11171 (2023). John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, et al

  44. [44]

    InProceedings of the 16th IEEE International Conference on Software Testing, Verification and Validation (ICST 2024)

    Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. InProceedings of the 16th IEEE International Conference on Software Testing, Verification and Validation (ICST 2024). 129–139. Xiang Zhang, Junbo Zhao, and Yann LeCun

  45. [45]

    InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)

    AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY , USA, 1592–1604.https://doi.org/10. 1145/3650212.3680384 Shide Zhou, Tianlin Li, Kailong Wang, Yihao Huang, Ling Shi, et al

  46. [46]

    In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

    Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 730–742.https://doi.org/10.1109/ICSE55347.2025.00209 19