On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Etienne Casanova; Rafal Kocielnik; R. Michael Alvarez

arxiv: 2606.00467 · v1 · pith:MHLOUFP4new · submitted 2026-05-30 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Etienne Casanova , Rafal Kocielnik , R. Michael Alvarez This is my paper

Pith reviewed 2026-06-28 19:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords LLM adaptabilityzero-shot annotationtask definition alignmentmemorization metricsdecision stickinesstoxicity detectionprompt correctionmodel priors

0 comments

The pith

LLMs follow misaligned task definitions with unchanged confidence and resist prompt correction for most zero-shot errors, with performance tied to definition-specific familiarity rather than memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how LLMs' internalized knowledge shapes their behavior in zero-shot annotation tasks such as toxicity detection across varied domains. It establishes that nearly two-thirds of initial errors cannot be fixed through added prompting information, and that models apply misaligned definitions without any drop in confidence. A new metric, Definition-Specific Familiarity, shows a positive link to performance after dataset controls, whereas three memorization measures show none. This matters for anyone relying on LLMs for reliable labeling because it shows prompt adjustments have narrow reach when priors conflict with instructions.

Core claim

We find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate of only 34.8%. High-confidence errors prove especially resistant. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. After controlling for dataset-level confounds, Definition-Specific Familiarity shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics all fail to show a positive association.

What carries the argument

Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the supplied task definition.

If this is right

Additional prompting corrects only about one-third of zero-shot annotation errors overall.
High-confidence errors remain especially resistant to rescue by prompt changes.
Models apply misaligned task definitions at the same confidence level as aligned ones.
Performance differences track definition alignment more than text-level memorization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-specific model selection could improve if DSF can be estimated quickly without running full datasets.
Annotation pipelines may need pre-screening steps that check concept alignment instead of depending on prompt iteration.
The same resistance pattern could limit reliability in LLM-as-a-judge setups outside toxicity detection.
Pretraining data choices that shape clear internal concepts may matter more for downstream annotation than volume of text alone.

Load-bearing premise

That Definition-Specific Familiarity validly measures alignment between a model's internal concept and the task definition, and that partial correlation after dataset controls isolates this effect without residual confounds.

What would settle it

A new set of annotation tasks where the partial correlation of DSF with performance falls to zero or below after the same dataset-level controls, or where one memorization metric shows a clear positive link.

Figures

Figures reproduced from arXiv: 2606.00467 by Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez.

**Figure 1.** Figure 1: Study overview and research questions. Left: zero-shot annotation setup: given an input text and a user-provided task definition, the LLM produces a label prediction. Right: we study three facets of the interaction between model-internalized task concepts and user instructions. RQ1 tests whether task familiarity correlates with performance, contrasting definition alignment (Definition-Specific Familiarity;… view at source ↗

**Figure 2.** Figure 2: Confidence elicitation prompt suffix appended to all classification prompts. LABEL is replaced with the dataset-specific positive class (e.g., toxic, hateful, offensive, etc.). The base prompt varies by condition (zero-shot, definition, few-shot, misaligned, etc.). This template is identical across all models. the Rescue Rate: Rescue Rate = P(Correct | Prompted, Zero-Shot Wrong) (1) This metric captures th… view at source ↗

**Figure 3.** Figure 3: Rescue probability vs. zero-shot confidence for zeroshot errors. The inverted-U shape exhibits two distinct failure modes (analyzed in Confidence and Decision Stickiness): decision stickiness at high confidence (right tail) and an out-of-distribution effect at very low confidence (left tail). 4.3. Answering RQ3: Misalignment Impact We analyze 6 misaligned definition conditions to investigate how LLMs resp… view at source ↗

**Figure 4.** Figure 4: Calibration curves showing confidence vs. actual accuracy across conditions. All conditions exhibit overconfidence (below the diagonal), with no meaningful separation between aligned and misaligned definitions: models cannot distinguish when they are applying incorrect instructions. Across all models and datasets, definition choice produces [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds only 34.8% of zero-shot annotation errors can be rescued by prompting and introduces DSF as a correlate of performance where memorization metrics fail, but the abstract gives almost no operational details on how either result was produced.

read the letter

The main takeaway is that prompt-based fixes have hard limits here: nearly two-thirds of initial errors in toxicity annotation stay put even after correction attempts, high-confidence mistakes are especially resistant, and models will adopt misaligned definitions without dropping their confidence. The new DSF metric is presented as capturing alignment between a model's internal concept and the supplied definition, and it shows a partial correlation of +0.41 with performance after dataset controls while ROUGE-L, BERTScore, and embedding similarity show none.

What the work actually does is run the same annotation task across multiple domains and both dense and mixture-of-experts models, then directly compare correction success against definition misalignment and against standard text-overlap measures. That contrast is the clearest contribution; it moves past generic accuracy numbers and tests whether familiarity with the exact definition matters more than surface-level memorization.

The soft spots are concentrated in the missing mechanics. The abstract states the rescue rate, the unchanged confidence under misalignment, and the partial r result but supplies no formula for DSF, no list of the dataset-level controls, no description of how the three memorization metrics were computed on the same items, and no mention of error bars or sample sizes. Without those, the claim that partial correlation isolates alignment rather than some residual confound cannot be checked. The stress-test concern about DSF construction is therefore still live on the basis of what is written.

This is for people who run LLM annotation or LLM-as-judge pipelines and want concrete numbers on when prompting stops helping. A methods-focused reader would get value from the multi-model, multi-domain setup and the specific percentages even if the mechanism needs more work. It is worth sending to peer review because the questions are practical and the measurements are specific, though any review would have to press hard for the operational definitions and controls before the central association can be trusted.

Referee Report

3 major / 2 minor

Summary. The manuscript examines interactions between LLM-internalized priors and user-provided task definitions in zero-shot annotation, focusing on toxicity detection across multiple datasets and model types. It reports that ~65% of zero-shot errors resist correction by additional prompting (overall rescue rate 34.8%), that high-confidence errors are especially sticky, that models follow misaligned definitions while preserving confidence levels, and that a newly introduced metric Definition-Specific Familiarity (DSF) exhibits a positive partial correlation (r = +0.41) with performance after dataset-level controls—unlike three memorization metrics (ROUGE-L, BERTScore, embedding cosine similarity).

Significance. If the DSF association and rescue-rate findings hold after full methodological disclosure, the work would usefully document prompt-correction limits and the relative importance of definition alignment versus surface memorization for annotation reliability, providing a concrete empirical basis for prioritizing definition-model fit in LLM-as-a-judge pipelines.

major comments (3)

[Abstract / Methods] Abstract and Methods: the partial-correlation claim (r = +0.41) between DSF and performance is load-bearing for the central argument that definition alignment matters more than memorization, yet the abstract supplies neither the operational definition of DSF nor the exact list of dataset-level control variables; without these, it is impossible to verify that the reported association isolates internalized-concept alignment rather than residual confounds or construction artifacts.
[Results] Results on decision stickiness: the 34.8% rescue rate and the claim that high-confidence errors are especially resistant are presented without error bars, exact prompt templates, or statistical controls for dataset difficulty; these omissions make it difficult to assess whether the two-thirds resistance figure generalizes or is driven by particular dataset properties.
[Experiments] Misaligned-definition experiment: the finding that models follow misaligned definitions with unchanged confidence requires explicit reporting of how alignment/misalignment was operationalized and how confidence was measured across conditions; absent these details the result cannot be evaluated for robustness against prompt phrasing artifacts.

minor comments (2)

[Abstract] The abstract states 'three distinct memorization metrics' but does not list the precise implementations or preprocessing steps used for ROUGE-L, BERTScore, and embedding cosine; adding these would aid reproducibility.
[Data] Dataset descriptions mention 'social media, gaming, news, and forums' but provide no table of sizes, label distributions, or inter-annotator agreement; a supplementary table would strengthen the claim that dataset-level confounds were adequately controlled.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed review and recommendations for improving the clarity and reproducibility of our work. We address each major comment below and will make the suggested revisions to enhance methodological transparency.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the partial-correlation claim (r = +0.41) between DSF and performance is load-bearing for the central argument that definition alignment matters more than memorization, yet the abstract supplies neither the operational definition of DSF nor the exact list of dataset-level control variables; without these, it is impossible to verify that the reported association isolates internalized-concept alignment rather than residual confounds or construction artifacts.

Authors: We agree that the abstract and Methods section should provide the operational definition of DSF and the specific dataset-level controls used in the partial correlation analysis. In the revised manuscript, we will explicitly define DSF and list the control variables to allow verification that the association reflects internalized-concept alignment. revision: yes
Referee: [Results] Results on decision stickiness: the 34.8% rescue rate and the claim that high-confidence errors are especially resistant are presented without error bars, exact prompt templates, or statistical controls for dataset difficulty; these omissions make it difficult to assess whether the two-thirds resistance figure generalizes or is driven by particular dataset properties.

Authors: We will include error bars on all reported rescue rates, provide the exact prompt templates used in an appendix, and add statistical controls for dataset difficulty in the revised Results section. revision: yes
Referee: [Experiments] Misaligned-definition experiment: the finding that models follow misaligned definitions with unchanged confidence requires explicit reporting of how alignment/misalignment was operationalized and how confidence was measured across conditions; absent these details the result cannot be evaluated for robustness against prompt phrasing artifacts.

Authors: We will expand the description of the misaligned-definition experiment to detail the operationalization of alignment and misalignment (including example definitions) and specify the method for measuring and comparing confidence levels across conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical associations with newly introduced metric

full rationale

The paper is an empirical study reporting experimental results on zero-shot annotation performance, decision stickiness (rescue rate 34.8%), and a new metric DSF. It states a partial correlation (partial r = +0.41) after dataset-level controls and contrasts it with null results for three memorization metrics. No equations, derivations, or self-citations are present that reduce the reported statistics or the DSF-performance link to quantities defined by the paper's own fitted parameters or inputs. The central claims rest on experimental measurements and controls rather than any self-definitional, fitted-input, or self-citation chain. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the validity of the new DSF metric and the adequacy of dataset-level controls; no free parameters are explicitly fitted in the abstract, but the partial correlation itself depends on the modeling choice of which confounds to partial out.

axioms (1)

domain assumption Partial correlation after removing dataset-level confounds isolates the effect of definition alignment from other dataset properties.
Invoked when the abstract reports the partial r = +0.41 result for DSF.

invented entities (1)

Definition-Specific Familiarity (DSF) no independent evidence
purpose: Quantifies alignment between a model's internalized concept and the supplied task definition.
Newly introduced metric whose construction is not detailed in the abstract; independent evidence is false because no external falsifiable prediction is given.

pith-pipeline@v0.9.1-grok · 5816 in / 1503 out tokens · 21993 ms · 2026-06-28T19:16:31.597874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 36 canonical work pages · 7 internal anchors

[1]

o ttger, P., Urman, A., Wendsj \

Baumann, J., R \"o ttger, P., Urman, A., Wendsj \"o , A., Plaza-del Arco, F. M., Gruber, J. B., and Hovy, D. Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation. arXiv preprint arXiv:2509.08825, 2025. doi:10.48550/arXiv.2509.08825. URL https://arxiv.org/abs/2509.08825

work page doi:10.48550/arxiv.2509.08825 2025
[2]

Nuanced metrics for measuring unintended bias with real data for text classification

Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, WWW '19, pp.\ 491--500, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450366755. doi:10.1145/3308560.3317593. URL ht...

work page doi:10.1145/3308560.3317593 2019
[3]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

1901
[4]

Extracting training data from large language models

Carlini, N., Tram \`e r, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, \'U ., Oprea, A., and Raffel, C. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https:/...

2021
[5]

A course correction in steerability evaluation: Revealing miscalibration and side effects in LLM s

Chang, T., Schnabel, T., Swaminathan, A., and Wiens, J. A course correction in steerability evaluation: Revealing miscalibration and side effects in LLM s. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026. URL https://arxiv.org/abs/2505.23816

work page arXiv 2026
[6]

Toxic comment classification challenge

cjadams, Sorensen, J., Elliott, J., Dixon, L., McDonald, M., nithum, and Cukierski, W. Toxic comment classification challenge. Kaggle Competition, 2018. URL https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

2018
[7]

Automated hate speech detection and the problem of offensive language

Davidson, T., Warmsley, D., Macy, M., and Weber, I. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11 0 (1): 0 512--515, 2017. doi:10.1609/icwsm.v11i1.14955

work page doi:10.1609/icwsm.v11i1.14955 2017
[8]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Y., and Belding, E

ElSherief, M., Kulkarni, V., Nguyen, D., Wang, W. Y., and Belding, E. Hate lingo: A target-based linguistic analysis of hate speech in social media. Proceedings of the International AAAI Conference on Web and Social Media, 12 0 (1), June 2018. doi:10.1609/icwsm.v12i1.15041. URL https://ojs.aaai.org/index.php/ICWSM/article/view/15041

work page doi:10.1609/icwsm.v12i1.15041 2018
[10]

Detecting

Gao, L. and Huang, R. Detecting online hate speech using context aware models. In Mitkov, R. and Angelova, G. (eds.), Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017 , pp.\ 260--266, Varna, Bulgaria, September 2017. INCOMA Ltd. doi:10.26615/978-954-452-049-6_036. URL https://aclanthology.org/R17-1036/

work page doi:10.26615/978-954-452-049-6_036 2017
[11]

A survey of confidence estimation and calibration in large language models

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., and Gurevych, I. A survey of confidence estimation and calibration in large language models. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers...

work page doi:10.18653/v1/2024.naacl-long.366 2024
[12]

ChatGPT outperforms crowd workers for text-annotation tasks , volume=

Gilardi, F., Alizadeh, M., and Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120 0 (30): 0 e2305016120, 2023. doi:10.1073/pnas.2305016120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2305016120

work page doi:10.1073/pnas.2305016120 2023
[13]

and Surdeanu, M

Golchin, S. and Surdeanu, M. Time travel in LLM s: Tracing data contamination in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2Rwq6c3tvr

2024
[14]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Han, P., Kocielnik, R., Song, P., Debnath, R., Mobbs, D., Anandkumar, A., and Alvarez, R. M. The personality illusion: Revealing dissociation between self-reports & behavior in LLM s. arXiv preprint arXiv:2509.03730, 2025. URL https://arxiv.org/abs/2509.03730

work page arXiv 2025
[16]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Bou Hanna, E., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Le Scao, T., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W. Mixtra...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and Potts, C. Dspy: Compiling declarative language model calls into self-improving pipelines. ArXiv, abs/2310.03714, 2023. URL https://api.semanticscholar.org/CorpusID:263671701

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Prometheus: Inducing fine-grained evaluation capability in language models

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8euJaTveKw

2024
[20]

Kocielnik, R., Li, Z., Kann, C., Sambrano, D., Morrier, J., Linegar, M., Taylor, C., Kim, M., Naqvie, N., Soltani, F., Dehpanah, A., Cahill, G., Anandkumar, A., and Alvarez, R. M. Challenges in moderating disruptive player behavior in online competitive action games. Frontiers in Computer Science, 6: 0 1283735, 2024. doi:10.3389/fcomp.2024.1283735. URL ht...

work page doi:10.3389/fcomp.2024.1283735 2024
[21]

A., Soltani, F., Sambrano, D., Anandkumar, A., and Alvarez, R

Kocielnik, R., Kim, M., Boonyarungsrit, P. A., Soltani, F., Sambrano, D., Anandkumar, A., and Alvarez, R. M. Prosocial behavior detection in player game chat: From aligning human- AI definitions to efficient annotation at scale. arXiv preprint arXiv:2508.05938, 2025 a . URL https://arxiv.org/abs/2508.05938

work page arXiv 2025
[22]

Kocielnik, R., Li, Z., Linegar, M., Sambrano, D., Soltani, F., Kim, M., Naqvie, N., Cahill, G., Anandkumar, A., and Alvarez, R. M. Online moderation in competitive action games: How intervention affects player behaviors. Proc. ACM Hum.-Comput. Interact., 9 0 (6), October 2025 b . doi:10.1145/3748599. URL https://doi.org/10.1145/3748599

work page doi:10.1145/3748599 2025
[23]

ROUGE : A package for automatic evaluation of summaries

Lin, C.-Y. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

2004
[24]

G -eval: NLG evaluation using gpt-4 with better human alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G -eval: NLG evaluation using gpt-4 with better human alignment. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v...

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[25]

Llama 3.3 model card

Meta AI . Llama 3.3 model card. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct, December 2024

2024
[26]

M., Varshney, K

Miehling, E., Desmond, M., Natesan Ramamurthy, K., Daly, E. M., Varshney, K. R., Farchi, E., Dognin, P., Rios, J., Bouneffouf, D., Liu, M., and Sattigeri, P. Evaluating the prompt steerability of large language models. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa...

work page doi:10.18653/v1/2025.naacl-long.400 2025
[27]

Rethinking the role of demonstrations: What makes in-context learning work? In Goldberg, Y., Kozareva, Z., and Zhang, Y

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 11048--11064, Abu Dhabi, United Arab Emirat...

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[28]

Mistral-small-24b-instruct-2501

Mistral AI . Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501, 2025

2025
[29]

S em E val-2016 task 6: Detecting stance in tweets

Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., and Cherry, C. S em E val-2016 task 6: Detecting stance in tweets. In Bethard, S., Carpuat, M., Cer, D., Jurgens, D., Nakov, P., and Zesch, T. (eds.), Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pp.\ 31--41, San Diego, California, June 2016. Association for...

work page doi:10.18653/v1/s16-1003 2016
[30]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

Nadeem, M., Bethke, A., and Reddy, S. S tereo S et: Measuring stereotypical bias in pretrained language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp....

work page doi:10.18653/v1/2021.acl-long.416 2021
[31]

Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. C row S -pairs: A challenge dataset for measuring social biases in masked language models. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1953--1967, Online, November 2020. Association for Comput...

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[32]

B., Thapa, S., and Zhang, Q

Naseem, U., Shiwakoti, S., Shah, S. B., Thapa, S., and Zhang, Q. G ame T ox: A comprehensive dataset and analysis for enhanced toxicity detection in online gaming communities. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human L...

work page doi:10.18653/v1/2025.naacl-short.37 2025
[33]

GPT-4o mini : Advancing cost-efficient intelligence

OpenAI . GPT-4o mini : Advancing cost-efficient intelligence. OpenAI Blog, 2024. URL https://openai.com/blog/gpt-4o-mini-advancing-cost-efficient-intelligence

2024
[34]

and Lee, L

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics ( ACL -04) , pp.\ 271--278, Barcelona, Spain, July 2004. doi:10.3115/1218955.1218990. URL https://aclanthology.org/P04-1035/

work page doi:10.3115/1218955.1218990 2004
[35]

M., and Bowman, S

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. BBQ : A hand-built bias benchmark for question answering. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2086--2105, Dublin, Ireland, May 2022. Association for Comput...

work page doi:10.18653/v1/2022.findings-acl.165 2022
[36]

and Holmes, C

Pawitan, Y. and Holmes, C. Confidence in the Reasoning of Large Language Models . Harvard Data Science Review, 7 0 (1), January 2025. https://hdsr.mitpress.mit.edu/pub/jaqt0vpb

2025
[37]

The ``problem'' of human label variation: On ground truth in data, modeling and evaluation

Plank, B. The ``problem'' of human label variation: On ground truth in data, modeling and evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 10671--10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/2022.emnlp-main.731 2022
[38]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Whose opinions do language models reflect? In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., and Hashimoto, T. Whose opinions do language models reflect? In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 29971--30004. ...

2023
[40]

Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., and Smith, N. A. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V. (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

work page doi:10.18653/v1/2022.naacl-main.431 2022
[41]

On second thought, let ' s not think step by step! bias and toxicity in zero-shot reasoning

Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let ' s not think step by step! bias and toxicity in zero-shot reasoning. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4454--4470, Toronto, Canada, ...

work page doi:10.18653/v1/2023.acl-long.244 2023
[42]

Detecting pretraining data from large language models

Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=zWqr3MQuNs

2024
[43]

A., Singhvi, A., Lai, L., Ryan, M

Tan, S., Agrawal, L. A., Singhvi, A., Lai, L., Ryan, M. J., Klein, D., Khattab, O., Sen, K., and Zaharia, M. L ang P ro B e: a language program benchmark. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 21489--21509, Suzhou, China, November 2025. Associa...

work page doi:10.18653/v1/2025.findings-emnlp.1172 2025
[44]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. pp.\ 5433--5442, December 2023. doi:10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330/

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[45]

S em E val-2018 task 3: Irony detection in E nglish tweets

Van Hee, C., Lefever, E., and Hoste, V. S em E val-2018 task 3: Irony detection in E nglish tweets. In Apidianaki, M., Mohammad, S. M., May, J., Shutova, E., Bethard, S., and Carpuat, M. (eds.), Proceedings of the 12th International Workshop on Semantic Evaluation, pp.\ 39--50, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. ...

work page doi:10.18653/v1/s18-1005 2018
[46]

Findings of the

Weld, H., Huang, G., Lee, J., Zhang, T., Wang, K., Guo, X., Long, S., Poon, J., and Han, C. CONDA : a CON textual dual-annotated dataset for in-game toxicity understanding and detection. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.\ 2406--2416, Online, August 2021. As...

work page doi:10.18653/v1/2021.findings-acl.213 2021
[47]

Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi, B. Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjeQKFxFpZ

2024
[48]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H., and Yamada, M. On verbalized confidence scores for LLMs . arXiv preprint arXiv:2412.14737, 2024. URL https://arxiv.org/abs/2412.14737

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

S em E val-2019 task 6: Identifying and categorizing offensive language in social media ( O ffens E val)

Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar, R. S em E val-2019 task 6: Identifying and categorizing offensive language in social media ( O ffens E val). In May, J., Shutova, E., Herbelot, A., Zhu, X., Apidianaki, M., and Mohammad, S. M. (eds.), Proceedings of the 13th International Workshop on Semantic Evaluation, pp.\ 75--8...

work page doi:10.18653/v1/s19-2010 2019
[50]

D., and Shi, W

Zhang, J., Yu, S., Chong, D., Sicilia, A., Tomz, M., Manning, C. D., and Shi, W. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. In The Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=9jQkmGunGo

2026
[51]

Q., and Artzi, Y

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. BERTScore : Evaluating text generation with BERT . In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=SkeHuCVFDr

2020
[52]

E., and Stoica, I

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao

2023

[1] [1]

o ttger, P., Urman, A., Wendsj \

Baumann, J., R \"o ttger, P., Urman, A., Wendsj \"o , A., Plaza-del Arco, F. M., Gruber, J. B., and Hovy, D. Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation. arXiv preprint arXiv:2509.08825, 2025. doi:10.48550/arXiv.2509.08825. URL https://arxiv.org/abs/2509.08825

work page doi:10.48550/arxiv.2509.08825 2025

[2] [2]

Nuanced metrics for measuring unintended bias with real data for text classification

Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, WWW '19, pp.\ 491--500, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450366755. doi:10.1145/3308560.3317593. URL ht...

work page doi:10.1145/3308560.3317593 2019

[3] [3]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

1901

[4] [4]

Extracting training data from large language models

Carlini, N., Tram \`e r, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, \'U ., Oprea, A., and Raffel, C. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https:/...

2021

[5] [5]

A course correction in steerability evaluation: Revealing miscalibration and side effects in LLM s

Chang, T., Schnabel, T., Swaminathan, A., and Wiens, J. A course correction in steerability evaluation: Revealing miscalibration and side effects in LLM s. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026. URL https://arxiv.org/abs/2505.23816

work page arXiv 2026

[6] [6]

Toxic comment classification challenge

cjadams, Sorensen, J., Elliott, J., Dixon, L., McDonald, M., nithum, and Cukierski, W. Toxic comment classification challenge. Kaggle Competition, 2018. URL https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

2018

[7] [7]

Automated hate speech detection and the problem of offensive language

Davidson, T., Warmsley, D., Macy, M., and Weber, I. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11 0 (1): 0 512--515, 2017. doi:10.1609/icwsm.v11i1.14955

work page doi:10.1609/icwsm.v11i1.14955 2017

[8] [8]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Y., and Belding, E

ElSherief, M., Kulkarni, V., Nguyen, D., Wang, W. Y., and Belding, E. Hate lingo: A target-based linguistic analysis of hate speech in social media. Proceedings of the International AAAI Conference on Web and Social Media, 12 0 (1), June 2018. doi:10.1609/icwsm.v12i1.15041. URL https://ojs.aaai.org/index.php/ICWSM/article/view/15041

work page doi:10.1609/icwsm.v12i1.15041 2018

[10] [10]

Detecting

Gao, L. and Huang, R. Detecting online hate speech using context aware models. In Mitkov, R. and Angelova, G. (eds.), Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017 , pp.\ 260--266, Varna, Bulgaria, September 2017. INCOMA Ltd. doi:10.26615/978-954-452-049-6_036. URL https://aclanthology.org/R17-1036/

work page doi:10.26615/978-954-452-049-6_036 2017

[11] [11]

A survey of confidence estimation and calibration in large language models

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., and Gurevych, I. A survey of confidence estimation and calibration in large language models. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers...

work page doi:10.18653/v1/2024.naacl-long.366 2024

[12] [12]

ChatGPT outperforms crowd workers for text-annotation tasks , volume=

Gilardi, F., Alizadeh, M., and Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120 0 (30): 0 e2305016120, 2023. doi:10.1073/pnas.2305016120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2305016120

work page doi:10.1073/pnas.2305016120 2023

[13] [13]

and Surdeanu, M

Golchin, S. and Surdeanu, M. Time travel in LLM s: Tracing data contamination in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2Rwq6c3tvr

2024

[14] [14]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Han, P., Kocielnik, R., Song, P., Debnath, R., Mobbs, D., Anandkumar, A., and Alvarez, R. M. The personality illusion: Revealing dissociation between self-reports & behavior in LLM s. arXiv preprint arXiv:2509.03730, 2025. URL https://arxiv.org/abs/2509.03730

work page arXiv 2025

[16] [16]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Bou Hanna, E., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Le Scao, T., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W. Mixtra...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and Potts, C. Dspy: Compiling declarative language model calls into self-improving pipelines. ArXiv, abs/2310.03714, 2023. URL https://api.semanticscholar.org/CorpusID:263671701

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Prometheus: Inducing fine-grained evaluation capability in language models

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8euJaTveKw

2024

[20] [20]

Kocielnik, R., Li, Z., Kann, C., Sambrano, D., Morrier, J., Linegar, M., Taylor, C., Kim, M., Naqvie, N., Soltani, F., Dehpanah, A., Cahill, G., Anandkumar, A., and Alvarez, R. M. Challenges in moderating disruptive player behavior in online competitive action games. Frontiers in Computer Science, 6: 0 1283735, 2024. doi:10.3389/fcomp.2024.1283735. URL ht...

work page doi:10.3389/fcomp.2024.1283735 2024

[21] [21]

A., Soltani, F., Sambrano, D., Anandkumar, A., and Alvarez, R

Kocielnik, R., Kim, M., Boonyarungsrit, P. A., Soltani, F., Sambrano, D., Anandkumar, A., and Alvarez, R. M. Prosocial behavior detection in player game chat: From aligning human- AI definitions to efficient annotation at scale. arXiv preprint arXiv:2508.05938, 2025 a . URL https://arxiv.org/abs/2508.05938

work page arXiv 2025

[22] [22]

Kocielnik, R., Li, Z., Linegar, M., Sambrano, D., Soltani, F., Kim, M., Naqvie, N., Cahill, G., Anandkumar, A., and Alvarez, R. M. Online moderation in competitive action games: How intervention affects player behaviors. Proc. ACM Hum.-Comput. Interact., 9 0 (6), October 2025 b . doi:10.1145/3748599. URL https://doi.org/10.1145/3748599

work page doi:10.1145/3748599 2025

[23] [23]

ROUGE : A package for automatic evaluation of summaries

Lin, C.-Y. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

2004

[24] [24]

G -eval: NLG evaluation using gpt-4 with better human alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G -eval: NLG evaluation using gpt-4 with better human alignment. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v...

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[25] [25]

Llama 3.3 model card

Meta AI . Llama 3.3 model card. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct, December 2024

2024

[26] [26]

M., Varshney, K

Miehling, E., Desmond, M., Natesan Ramamurthy, K., Daly, E. M., Varshney, K. R., Farchi, E., Dognin, P., Rios, J., Bouneffouf, D., Liu, M., and Sattigeri, P. Evaluating the prompt steerability of large language models. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa...

work page doi:10.18653/v1/2025.naacl-long.400 2025

[27] [27]

Rethinking the role of demonstrations: What makes in-context learning work? In Goldberg, Y., Kozareva, Z., and Zhang, Y

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 11048--11064, Abu Dhabi, United Arab Emirat...

work page doi:10.18653/v1/2022.emnlp-main.759 2022

[28] [28]

Mistral-small-24b-instruct-2501

Mistral AI . Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501, 2025

2025

[29] [29]

S em E val-2016 task 6: Detecting stance in tweets

Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., and Cherry, C. S em E val-2016 task 6: Detecting stance in tweets. In Bethard, S., Carpuat, M., Cer, D., Jurgens, D., Nakov, P., and Zesch, T. (eds.), Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pp.\ 31--41, San Diego, California, June 2016. Association for...

work page doi:10.18653/v1/s16-1003 2016

[30] [30]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

Nadeem, M., Bethke, A., and Reddy, S. S tereo S et: Measuring stereotypical bias in pretrained language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp....

work page doi:10.18653/v1/2021.acl-long.416 2021

[31] [31]

Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. C row S -pairs: A challenge dataset for measuring social biases in masked language models. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1953--1967, Online, November 2020. Association for Comput...

work page doi:10.18653/v1/2020.emnlp-main.154 2020

[32] [32]

B., Thapa, S., and Zhang, Q

Naseem, U., Shiwakoti, S., Shah, S. B., Thapa, S., and Zhang, Q. G ame T ox: A comprehensive dataset and analysis for enhanced toxicity detection in online gaming communities. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human L...

work page doi:10.18653/v1/2025.naacl-short.37 2025

[33] [33]

GPT-4o mini : Advancing cost-efficient intelligence

OpenAI . GPT-4o mini : Advancing cost-efficient intelligence. OpenAI Blog, 2024. URL https://openai.com/blog/gpt-4o-mini-advancing-cost-efficient-intelligence

2024

[34] [34]

and Lee, L

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics ( ACL -04) , pp.\ 271--278, Barcelona, Spain, July 2004. doi:10.3115/1218955.1218990. URL https://aclanthology.org/P04-1035/

work page doi:10.3115/1218955.1218990 2004

[35] [35]

M., and Bowman, S

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. BBQ : A hand-built bias benchmark for question answering. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2086--2105, Dublin, Ireland, May 2022. Association for Comput...

work page doi:10.18653/v1/2022.findings-acl.165 2022

[36] [36]

and Holmes, C

Pawitan, Y. and Holmes, C. Confidence in the Reasoning of Large Language Models . Harvard Data Science Review, 7 0 (1), January 2025. https://hdsr.mitpress.mit.edu/pub/jaqt0vpb

2025

[37] [37]

The ``problem'' of human label variation: On ground truth in data, modeling and evaluation

Plank, B. The ``problem'' of human label variation: On ground truth in data, modeling and evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 10671--10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/2022.emnlp-main.731 2022

[38] [38]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Whose opinions do language models reflect? In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., and Hashimoto, T. Whose opinions do language models reflect? In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 29971--30004. ...

2023

[40] [40]

Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., and Smith, N. A. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V. (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

work page doi:10.18653/v1/2022.naacl-main.431 2022

[41] [41]

On second thought, let ' s not think step by step! bias and toxicity in zero-shot reasoning

Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let ' s not think step by step! bias and toxicity in zero-shot reasoning. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4454--4470, Toronto, Canada, ...

work page doi:10.18653/v1/2023.acl-long.244 2023

[42] [42]

Detecting pretraining data from large language models

Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=zWqr3MQuNs

2024

[43] [43]

A., Singhvi, A., Lai, L., Ryan, M

Tan, S., Agrawal, L. A., Singhvi, A., Lai, L., Ryan, M. J., Klein, D., Khattab, O., Sen, K., and Zaharia, M. L ang P ro B e: a language program benchmark. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 21489--21509, Suzhou, China, November 2025. Associa...

work page doi:10.18653/v1/2025.findings-emnlp.1172 2025

[44] [44]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. pp.\ 5433--5442, December 2023. doi:10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330/

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[45] [45]

S em E val-2018 task 3: Irony detection in E nglish tweets

Van Hee, C., Lefever, E., and Hoste, V. S em E val-2018 task 3: Irony detection in E nglish tweets. In Apidianaki, M., Mohammad, S. M., May, J., Shutova, E., Bethard, S., and Carpuat, M. (eds.), Proceedings of the 12th International Workshop on Semantic Evaluation, pp.\ 39--50, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. ...

work page doi:10.18653/v1/s18-1005 2018

[46] [46]

Findings of the

Weld, H., Huang, G., Lee, J., Zhang, T., Wang, K., Guo, X., Long, S., Poon, J., and Han, C. CONDA : a CON textual dual-annotated dataset for in-game toxicity understanding and detection. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.\ 2406--2416, Online, August 2021. As...

work page doi:10.18653/v1/2021.findings-acl.213 2021

[47] [47]

Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi, B. Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjeQKFxFpZ

2024

[48] [48]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H., and Yamada, M. On verbalized confidence scores for LLMs . arXiv preprint arXiv:2412.14737, 2024. URL https://arxiv.org/abs/2412.14737

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

S em E val-2019 task 6: Identifying and categorizing offensive language in social media ( O ffens E val)

Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar, R. S em E val-2019 task 6: Identifying and categorizing offensive language in social media ( O ffens E val). In May, J., Shutova, E., Herbelot, A., Zhu, X., Apidianaki, M., and Mohammad, S. M. (eds.), Proceedings of the 13th International Workshop on Semantic Evaluation, pp.\ 75--8...

work page doi:10.18653/v1/s19-2010 2019

[50] [50]

D., and Shi, W

Zhang, J., Yu, S., Chong, D., Sicilia, A., Tomz, M., Manning, C. D., and Shi, W. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. In The Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=9jQkmGunGo

2026

[51] [51]

Q., and Artzi, Y

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. BERTScore : Evaluating text generation with BERT . In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=SkeHuCVFDr

2020

[52] [52]

E., and Stoica, I

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao

2023