pith. sign in

arxiv: 2605.30018 · v2 · pith:F4EY2RBXnew · submitted 2026-05-28 · 💻 cs.CL · cs.LG

Latent Performance Profiling of Large Language Models

Pith reviewed 2026-06-29 07:28 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords latent performance profilinglarge language modelsintrinsic evaluationhidden activationsoutput distributionsmodel signaturesbenchmark alternatives
0
0 comments X

The pith

Latent Performance Profiling extracts task-agnostic signatures from LLM hidden activations and output distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues for moving beyond benchmark accuracy scores when evaluating large language models because those scores suffer from data contamination, narrow scope, and poor alignment with real reliability. It introduces Latent Performance Profiling as a complementary method that computes scalar metrics directly from a model's internal hidden activations and output distributions. These metrics are presented as stable across models of similar size yet sensitive to architectural differences. Tests across eight models ranging from 0.5B to 14B parameters show that models posting similar benchmark numbers can still differ markedly in properties such as entropy and adaptability. The resulting profiles are intended to support more reliable model selection and safety assessment.

Core claim

LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability.

What carries the argument

Latent Performance Profiling (LPP), a framework that derives task-agnostic scalar metrics from hidden activations and output distributions to characterize latent representations and dynamics.

If this is right

  • Models posting similar benchmark scores can still be separated by their latent profiles in entropy or adaptability.
  • Synthetic probes for uncertainty and symbolic reasoning can be constructed to match LPP metrics while remaining decoupled from leaderboard data.
  • Reporting LPP values together with benchmark scores supplies a deeper basis for model selection and safety assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LPP signatures might indicate how a model will behave on entirely new tasks outside current benchmarks.
  • The approach could help flag internal inconsistencies that signal data contamination even when benchmark accuracy appears high.
  • Characteristic LPP patterns tied to model families might guide selection for deployment contexts where adaptability or calibrated uncertainty matters most.

Load-bearing premise

Metrics taken from hidden activations and output distributions are task-agnostic and capture meaningful scale-independent traits that remain independent of benchmark performance and data contamination.

What would settle it

Finding that the same model produces substantially different LPP metric values when tested on varied but semantically related tasks, or that LPP differences fail to predict outcomes on new synthetic probes for uncertainty and symbolic reasoning.

Figures

Figures reproduced from arXiv: 2605.30018 by Amlan Chakrabarti, Ayan Sengupta, Lipika Dey, Mayank Vatsa, Partha Pratim Chakrabarti, Partha Pratim Das, Richa Singh, Suparna Bhattacharya, Supratik Chakraborty, Tanmoy Chakraborty.

Figure 1
Figure 1. Figure 1: Our proposed framework – Latent Performance Profiling (LPP), aims to uncover differences in LLM behavior that extrinsic benchmarks may miss. While traditional evaluations might suggest similar capabilities across models, LPP probes their internal activations and representations to reveal deeper distinctions. (A) We highlight instances where three LLMs are evaluated on four downstream tasks, across two task… view at source ↗
Figure 2
Figure 2. Figure 2: Extrinsic benchmark performance, latent profiling, and correlation with new LPP-driven tasks. This figure summarizes the core empirical findings of our study. (A) Comparison of model performance across three widely used extrinsic benchmarks (MMLU-PRO, BBH, and IFEval), revealing size-dependent trends and inconsistency across tasks – larger models do not uniformly outperform smaller ones. (B) LPP metrics – … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of (A) extrinsic, (B) intrinsic, and (C) synthetic metrics for LLM ranking. (A) MMLU-PRO vs. IFEval shows inconsistent rankings across models (e.g., Llama-3B vs. Qwen-3B), with a colluded top-right cluster indicating poor separability under extrinsic evaluation. (B) Intrinsic metrics (entropy, participation ratio, effective rank) clearly separate models and offer actionable guidance for selectin… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of LPP metrics on (A) prefix length, and (B) context length. Across variations in context and prefix, entropy, participation ratio, and effective rank exhibit smooth and architecture-consistent behavior, establishing them as robust intrinsic indicators of internal model dynamics. Entropy reliably tracks contextual uncertainty, while PR and ER capture representational spread and dimensionality c… view at source ↗
Figure 1
Figure 1. Figure 1: Sensitivity of intrinsic metrics across sample size (A) and dataset (B). We use dataset sample sizes of {10,100,500,1000} and three different language modeling datasets – Alpaca, Dolly, and WikiText. Detailed results shown in Figures 2 and 3, respectively. Each bar shows the mean entropy, PR, and ER across models and datasets. The trends demonstrate that while dataset-specific scaling slightly shifts the a… view at source ↗
Figure 2
Figure 2. Figure 2: Stability of LPP metrics across varying sample sizes. This figure presents entropy (left), participation ratio (center), and effective rank (right) computed over increasing sample sizes (10, 100, 500, 1000) from the Alpaca dataset for different LLMs. Each line denotes a different model. Entropy values (left) increase and then stabilize for all models, indicating early convergence in output uncertainty. The… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of LPP metrics – entropy, PR, and ER for different LLMs calibrated using three distinct calibration datasets: Alpaca, Dolly, and WikiText. Each subplot shows how a given metric varies across models for each dataset, using 100 samples per dataset. The top panel illustrates entropy trends, where models like Qwen-14B and Llama-3B show low entropy across datasets, indicating stable and confident pre… view at source ↗
Figure 4
Figure 4. Figure 4: Layerwise sensitivity of LPP metrics – Participation Ratio (left) and Effective Rank (right) across normalized depth for a range of language models. Each curve corresponds to a different model, with layer depth normalized from input (0.0) to output (1.0). Across all models, both PR and ER display a distinctive “hourglass” profile: high values at the initial layers, dropping to a pronounced minimum in the m… view at source ↗
Figure 5
Figure 5. Figure 5: LPP metrics for different LLMs using five aggregation schemes: (a) entropy minimum, PR maximum, ER maximum; (b) all metrics aggregated via median; (c) all metrics aggregated via mean; (d) all metrics aggregated via minimum; (e) all metrics aggregated via maximum. Bars represent the aggregated score contribution from each metric (entropy, participation ratio, effective rank), revealing how summary statistic… view at source ↗
Figure 6
Figure 6. Figure 6: Comparative analysis of how the number of in-context examples (0, 1, 5, and 10) influences the performance of various LLMs on synthetic tasks: AR and SPC. (a) illustrates model accuracy on AR tasks, showing a general trend of improved performance with an increasing number of in-context examples. (b) displays the F1 scores for SPC tasks, indicating more substantial performance gains for larger models like Q… view at source ↗
read the original abstract

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, state-centered intrinsic assessment of LLMs. To this end, we introduce Latent Performance Profiling (LPP) -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Latent Performance Profiling (LPP), a framework deriving task-agnostic scalar metrics (e.g., entropy, adaptability) from LLMs' hidden activations and output distributions. It positions LPP as complementary to benchmarks, claiming these metrics yield stable, architecture-sensitive, scale-independent signatures. Empirical results across eight models (0.5B–14B parameters) are said to show that models with similar benchmark scores can have contrasting latent profiles, and the work proposes synthetic probes for uncertainty and symbolic reasoning aligned with the intrinsic metrics.

Significance. If the LPP metrics can be shown to be robustly task-agnostic and independent of input choice and benchmark contamination, the approach would offer a valuable intrinsic complement to leaderboard evaluations, enabling better detection of hidden vulnerabilities and more reliable model selection. The reported ability to distinguish models with matched benchmark performance via latent signatures would be a concrete strength if empirically substantiated.

major comments (1)
  1. [Abstract] Abstract: the central claim that LPP metrics are task-agnostic and reveal intrinsic, scale-independent traits independent of benchmarks rests on the unverified assumption that the derived scalars remain stable under different input regimes. Because any activation-based metric requires feeding the model some distribution of inputs, the absence of any description of input sampling, variation across regimes, or invariance tests means the signatures could be artifacts of the chosen inputs rather than intrinsic properties; this directly undermines the asserted contrast with benchmark-centric evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern regarding input sampling and invariance below, and will revise the manuscript to strengthen the presentation of the task-agnostic claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that LPP metrics are task-agnostic and reveal intrinsic, scale-independent traits independent of benchmarks rests on the unverified assumption that the derived scalars remain stable under different input regimes. Because any activation-based metric requires feeding the model some distribution of inputs, the absence of any description of input sampling, variation across regimes, or invariance tests means the signatures could be artifacts of the chosen inputs rather than intrinsic properties; this directly undermines the asserted contrast with benchmark-centric evaluation.

    Authors: We agree that the absence of explicit details on input sampling and invariance testing in the abstract (and methods) leaves the task-agnostic claim open to the interpretation raised. The current manuscript does not report variation across input regimes or formal invariance tests. In the revised version we will add a dedicated subsection describing the input distribution used for activation collection and include new experiments that quantify metric stability under altered input regimes (for example, domain-restricted prompt sets). These changes will directly substantiate the intrinsic nature of the signatures. revision: yes

Circularity Check

0 steps flagged

No circularity: LPP introduced as empirical framework without self-referential definitions or fitted predictions

full rationale

The provided abstract and description present LPP as a new set of scalar metrics extracted from hidden activations and output distributions, with claims of task-agnosticism and scale-independence supported by empirical analyses across models. No equations, parameter-fitting steps, or self-citations are shown that would reduce any metric to its own inputs by construction. The derivation chain consists of proposing diagnostics and demonstrating contrasts with benchmarks, remaining self-contained as an observational proposal rather than a closed definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on abstract; central claim rests on domain assumptions about latent states providing independent diagnostics.

axioms (1)
  • domain assumption Hidden activations and output distributions contain task-agnostic information about model capabilities and vulnerabilities
    Invoked when defining LPP metrics as revealing scale-independent traits.
invented entities (1)
  • Latent Performance Profiling (LPP) no independent evidence
    purpose: Framework for deriving intrinsic scalar metrics from latent representations
    Newly introduced method in the paper.

pith-pipeline@v0.9.1-grok · 5845 in / 1135 out tokens · 29032 ms · 2026-06-29T07:28:36.763781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Y., Gulamali, F., and Joshi, S

    Agrawal, M., Chen, I. Y., Gulamali, F., and Joshi, S. (2025). The evaluation illusion of large language models in medicine. npj Digital Medicine , 8(1):600

  2. [2]

    Banerjee, S., Agarwal, A., and Singh, E. (2024). The vulnerability of language model benchmarks: Do they accurately reflect true llm performance? arXiv preprint arXiv:2412.03597

  3. [3]

    Bang, Y., Ji, Z., Schelten, A., Hartshorn, A., Fowler, T., Zhang, C., Cancedda, N., and Fung, P. (2025). Hallulens: Llm hallucination benchmark. arXiv preprint arXiv:2504.17550

  4. [4]

    and Gavves, S

    Bereska, L. and Gavves, S. (2024). Mechanistic interpretability for AI safety - a review. Transactions on Machine Learning Research , pages 1--55

  5. [5]

    A., MacKnight, R., Kline, B., and Gomes, G

    Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. (2023). Autonomous chemical research with large language models. Nature , 624(7992):570--578

  6. [6]

    Carranza, J. M. N. (2025). LLM s show surface-form brittleness under paraphrase stress tests. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , pages 1--5

  7. [7]

    Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2024). A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1--45

  8. [8]

    Chen, M. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  9. [9]

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  10. [10]

    Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., and Xin, R. (2023). Free dolly: Introducing the world's first truly open instruction-tuned llm

  11. [11]

    F., Lan, Q., Rahman, P., Mahmood, A

    Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. (2024). Loss of plasticity in deep continual learning. Nature , 632(8026):768--774

  12. [12]

    Dreyer, M., Berend, J., Labarta, T., Vielhaben, J., Wiegand, T., Lapuschkin, S., and Samek, W. (2025). Mechanistic understanding and validation of large AI models with SemanticLens . Nature Machine Intelligence , 7(9):1572--1585

  13. [13]

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  14. [14]

    Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature , 630(8017):625--630

  15. [15]

    Fodor, J. (2025). Line goes up? inherent limitations of benchmarks for evaluating large language models. arXiv preprint arXiv:2502.14318

  16. [16]

    Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., and Wolf, T. (2024). Open llm leaderboard v2

  17. [17]

    Gilardi, F., Alizadeh, M., and Kubli, M. (2023). Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences , 120(30):e2305016120

  18. [18]

    Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 , ICML'17, pages 1321--1330. JMLR.org

  19. [19]

    Guo, J., Gu, S., Jin, M., Spanos, C., and Lavaei, J. (2025). Stylebench: Evaluating thinking styles in large language models. arXiv preprint arXiv:2509.20868

  20. [20]

    Haller, P., Ibrahim, M., Kirichenko, P., Sagun, L., and Bell, S. J. (2025). Llm knowledge is brittle: Truthfulness representations rely on superficial resemblance. arXiv preprint arXiv:2510.11905

  21. [21]

    A., and Gal, Y

    Han, J., Kossen, J., Razzak, M., Schut, L., Malik, S. A., and Gal, Y. (2024). Semantic entropy probes: Robust and cheap hallucination detection in llms. In ICML 2024 Workshop on Foundation Models in the Wild

  22. [22]

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In International Conference on Learning Representations , pages 1--27

  23. [23]

    Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. arXiv preprint arXiv:2102.01293

  24. [24]

    Jha, N. K. and Reagen, B. (2025). Spectral scaling laws in language models: emph How Effectively Do Feed-Forward Networks Use Their Latent Space? In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V., editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 35047--35058, Suzhou, China. Associatio...

  25. [25]

    Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b

  26. [26]

    Mixtral of Experts

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088

  27. [27]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  28. [28]

    Li, X., Li, X., Dong, K., Zhang, Q., Ruan, R., Dai, X., Liu, X., Xu, S., Wang, Y., and Tang, R. (2025). Humanity's last code exam: Can advanced llms conquer human's hardest code competition? arXiv preprint arXiv:2506.12713

  29. [29]

    D., Re, C., Acosta-Navas, D., Hudson, D

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., WANG, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

  30. [30]

    Y., Deng, Y., Chandu, K., Ravichander, A., Pyatkin, V., Dziri, N., Bras, R

    Lin, B. Y., Deng, Y., Chandu, K., Ravichander, A., Pyatkin, V., Dziri, N., Bras, R. L., and Choi, Y. (2025). Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild. In The Thirteenth International Conference on Learning Representations , pages 1--19

  31. [31]

    Lin, S., Hilton, J., and Evans, O. (2022). T ruthful QA : Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

  32. [32]

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2017). Pointer sentinel mixture models. In International Conference on Learning Representations , pages 1--15

  33. [33]

    Meta, A. (2025). The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on , 4(7):2025

  34. [34]

    A., and Leonelli, S

    Milano, S., McGrane, J. A., and Leonelli, S. (2023). Large language models challenge the future of higher education. Nature Machine Intelligence , 5(4):333--334

  35. [35]

    and Krishnan, N

    Miret, S. and Krishnan, N. M. A. (2025). Enabling large language models for real-world materials discovery. Nature Machine Intelligence , 7(7):991--998

  36. [36]

    Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. (2024). Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems , 37:41076--41102

  37. [37]

    Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. (2019). Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch\' e -Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information P...

  38. [38]

    G., Mao, H., Yan, F., Ji, C

    Patil, S. G., Mao, H., Yan, F., Ji, C. C.-J., Suresh, V., Stoica, I., and Gonzalez, J. E. (2025). The berkeley function calling leaderboard ( BFCL ): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning , pages 1--22

  39. [39]

    L., and Agirre, E

    Sainz, O., Campos, J., Garc \'i a-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 10776--10787, Singapore. Associatio...

  40. [40]

    and Chopra, P

    Sharma, A. and Chopra, P. (2025). Think just enough: Sequence-level entropy as a confidence signal for llm reasoning. arXiv preprint arXiv:2510.08146

  41. [41]

    R., Zhao, D., Patel, N

    Skean, O., Arefin, M. R., Zhao, D., Patel, N. N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv, R. (2025). Layer by layer: Uncovering hidden representations in language models. In Forty-second International Conference on Machine Learning , pages 1--22

  42. [42]

    Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., Dsouza, A., Slone, A., Rahane, A., Iyer, A. S., And...

  43. [43]

    W., and Smyth, P

    Steyvers, M., Tejeda, H., Kumar, A., Belem, C., Karny, S., Hu, X., Mayer, L. W., and Smyth, P. (2025). What large language models know and what people think they know. Nature Machine Intelligence , 7(2):221--231

  44. [44]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. GitHub repository

  45. [45]

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. (2024). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., ed...

  46. [46]

    Wei, L., Tan, Z., Li, C., Wang, J., and Huang, W. (2024). Diff-erank: A novel rank-based metric for evaluating large language models. Advances in Neural Information Processing Systems , 37:39501--39521

  47. [47]

    R., and Kulik, H

    Xin, H., Kitchin, J. R., and Kulik, H. J. (2025). Towards agentic science for advancing scientific discovery. Nature Machine Intelligence , 7(9):1373--1375

  48. [48]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388

  49. [49]

    Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., Zhang, X., Xin, Y., Yin, Q., Li, S., and Wei, F. (2025). MMLU - CF : A contamination-free multi-task language understanding benchmark. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...

  50. [50]

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. (2023). Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911

  51. [51]

    Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., and Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature , 634(8032):61--68

  52. [52]

    Zhuang, R., Wu, T., Wen, Z., Li, A., Jiao, J., and Ramchandran, K. (2025). Embed LLM : Learning compact representations of large language models. In The Thirteenth International Conference on Learning Representations , pages 1--14

  53. [53]

    Y., Chien, V

    Zhuo, T. Y., Chien, V. M., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., Brunner, S., GONG, C., Hoang, J., Zebaze, A. R., Hong, X., Li, W.-D., Kaddour, J., Xu, M., Zhang, Z., Yadav, P., Jain, N., Gu, A., Cheng, Z., Liu, J., Liu, Q., Wang, Z., Lo, D., Hui, B., Muennighoff, N., Fried, D., Du, X., de Vries, H., and Wer...

  54. [54]

    , " * write output.state after.block = add.period write newline

    ENTRY address archive author booktitle chapter edition editor eprint howpublished institution journal key month note number organization pages publisher school series title type url doi volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sente...

  55. [55]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...