arxiv: 2604.19786 · v1 · submitted 2026-03-31 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords humor generationLLM evaluationtournament rankingGeneral Theory of Verbal Humorpairwise comparisonBradley-Terry modelmodel benchmarkingcomedic mechanisms

0 comments

The pith

HumorRank ranks language models on humor generation through automated joke tournaments that reveal skill in comedic mechanisms over model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HumorRank as a tournament-based system to evaluate and compare how well large language models create humorous text. Pairwise matchups between model outputs draw on the General Theory of Verbal Humor for judgments, which are then organized through an Adaptive Swiss tournament structure and turned into overall rankings with Bradley-Terry statistical modeling. Testing across nine models from different categories produces clear stratifications that tie higher performance to better command of humor techniques. This replaces scattered individual metrics with one consistent leaderboard. The result supplies a repeatable way to measure and track advances in AI humor generation.

Core claim

HumorRank is a tournament-based evaluation framework and leaderboard that performs automated pairwise comparisons of LLM-generated humor using judgments grounded in the General Theory of Verbal Humor, aggregates those results via an Adaptive Swiss tournament, and derives globally consistent rankings through Bradley-Terry Maximum Likelihood Estimation, yielding statistically grounded model stratifications that show humor quality depends on mastery of comedic mechanisms rather than model scale.

What carries the argument

HumorRank tournament system that converts GTVH-grounded pairwise judgments into global rankings through Adaptive Swiss scheduling and Bradley-Terry MLE.

Load-bearing premise

Automated pairwise judgments based on the General Theory of Verbal Humor accurately capture true humor quality without systematic bias.

What would settle it

A direct comparison study in which human raters evaluate the same model outputs and produce model rankings that differ substantially from those generated by HumorRank.

Figures

Figures reproduced from arXiv: 2604.19786 by Edward Ajayi, Prasenjit Mitra.

**Figure 1.** Figure 1: HumorRank Leaderboard (left) and Pairwise Win-Rate Heatmap (right) showing the performance of the 9 models. Remarkably, the specialized HumorGen-7B model (Rank 4, BT = 1092.8) successfully bridges the gap between the mid-tier open-weights and the proprietary frontier, cleanly outperforming models an order of magnitude larger (e.g., GPT OSS 120B, Rank 6) as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Per-model winning feature distributions (Llama 3.3 70B judge). Left: Humor mechanisms (% of wins). Right: Delivery features (% of wins). Frontier models dominate via Conciseness; the specialist model leads on Absurdity and Escalation; baseline models over-index on Wordplay. higher Overexplained (25.2%) and Buried Punchline (20.4%) failure rates than any other model, indicating that its aggressive structura… view at source ↗

**Figure 3.** Figure 3: Per-model failure mode distributions (Llama 3.3 70B judge). Clich´e and Weak Punchline dominate most models, but HumorGen-7B stands out with markedly higher Overexplained and Buried Punchline rates—a byproduct of its deep-structure comedic strategy. Qwen 2.5 72B failure modes are in Appendix F. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: HumorRank Leaderboard (top) and Pairwise Win-Rate Heatmap (bottom) showing the performance of the 9 models. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Four representative LLaMA judge decisions. Winner ✓ (green) and Loser × (red) are labelled directly on each joke box. Feature rows indicate winning humor traits (green), delivery strengths (blue), and loser weaknesses (red). ELO deltas are approximated from the evaluation log. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template used for all 10,800 pairwise comparisons. Template variables ({headline}, {joke a}, {joke b}) are instantiated per comparison. The three feature lists (humor mechanisms, delivery, and loser features) enforce structured and consistent JSON outputs across all evaluations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Per-model winning feature distributions (Qwen 2.5 72B judge). Left: Humor mechanisms. Right: Delivery features. Rank patterns are consistent with the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Per-model failure mode distributions (Qwen 2.5 72B judge). HumorGen-7B again shows markedly higher Overexplained (49.5%) rates compared to other models, consistent with findings under the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Instructions screen (HumorRank Blind Evaluation) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Sample evaluated pair showing the blind comparison interface. Evaluators see two anonymized jokes (Option A and Option B) for a given headline and select the funnier response. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HumorRank applies Adaptive Swiss tournaments and Bradley-Terry aggregation to GTVH-grounded pairwise judgments for ranking LLM humor generation, which is a new combination, but the lack of human validation on those judgments undercuts the claim that mechanism mastery trumps model scale.

read the letter

The main takeaway is that HumorRank applies Adaptive Swiss tournaments and Bradley-Terry aggregation to GTVH-grounded pairwise judgments for ranking LLM humor generation, which is a new combination, but the lack of human validation on those judgments undercuts the claim that mechanism mastery trumps model scale. The paper does a solid job of moving past isolated metrics by creating a unified leaderboard. Using the SemEval-2026 MWAHAHA test set across nine models from different categories, it runs pairwise evaluations based on the General Theory of Verbal Humor and aggregates them into consistent rankings. This setup is scalable and gives a way to compare systems directly, which is useful for tracking progress in this tricky area. What stands out as new is the specific application of tournament scheduling to humor, along with the MLE for global consistency. Prior work on humor eval didn't combine these elements this way, so the framework itself has some novelty. The soft spots come down to the automated judgments. The abstract and available details don't include any human validation, inter-rater agreement scores, or bias audits for the GTVH-based evaluators. Since the key result—that humor quality comes from comedic mechanisms rather than scale—depends entirely on those judgments being accurate, this is a real gap. Without that, the stratification could reflect judge biases instead of true differences. Sensitivity to tournament parameters also isn't addressed, which leaves the robustness unclear. This work is aimed at researchers in LLM evaluation and natural language generation who care about creative tasks. A reader looking for new benchmark ideas would get value from the overall structure and could adapt the tournament approach, but anyone relying on the specific findings would need to verify the judgments first. It deserves a serious referee. The methods are grounded in established ranking techniques, and the problem of incomparable humor metrics is real, so feedback from reviewers could strengthen it. I would recommend sending it to peer review, provided the authors are prepared to add validation steps and more empirical details on the results.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation in LLMs. Using the SemEval-2026 MWAHAHA test dataset, it conducts automated pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) across nine models, aggregates outcomes via an Adaptive Swiss tournament, and applies Bradley-Terry MLE to produce globally consistent rankings. The central claim is that these rankings yield statistically grounded stratifications demonstrating that humor quality is driven by mastery of comedic mechanisms rather than model scale alone.

Significance. If the automated judgments prove reliable, HumorRank would provide a valuable, scalable methodology for unified benchmarking of LLM humor generation, replacing isolated incomparable metrics with interpretable global rankings. The GTVH grounding and Bradley-Terry aggregation offer a theoretically motivated and reproducible approach that could help track progress and identify key drivers of humor capability.

major comments (2)

[Evaluation pipeline] Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.
[Results section] Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important areas for strengthening the manuscript. We address each major point below and commit to revisions that enhance the rigor and transparency of our evaluation framework.

read point-by-point responses

Referee: Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.

Authors: We agree that human validation is essential to substantiate the reliability of the automated GTVH judgments and rule out potential biases. In the revised manuscript, we will add a dedicated validation subsection reporting results from a human study on a stratified sample of 300 pairwise comparisons. This will include inter-annotator agreement metrics (Cohen's kappa and Fleiss' kappa), a bias audit examining correlations between judgment errors and model family/size, and qualitative analysis of disagreement cases. These additions will directly support the claim that observed stratifications reflect genuine differences in comedic mechanism mastery rather than judge artifacts. revision: yes
Referee: Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

Authors: We acknowledge that the current results section would be strengthened by explicit statistical support. In the revision, we will expand the results to include: bootstrap 95% confidence intervals on all Bradley-Terry parameters; statistical significance tests (Mann-Whitney U and permutation tests) comparing the mechanism-mastery group against scale-based groupings; and sensitivity analyses varying Adaptive Swiss parameters (e.g., round count from 4-12 and reporting Kendall tau rank stability across configurations). These will be presented with tables and figures to rigorously ground the reported stratifications. revision: yes

Circularity Check

0 steps flagged

No circularity: HumorRank rankings derive from external GTVH judgments aggregated by standard MLE without self-referential reduction

full rationale

The paper's derivation chain consists of (1) applying the external General Theory of Verbal Humor to generate automated pairwise judgments on the SemEval-2026 dataset, followed by (2) aggregation via Adaptive Swiss tournament and Bradley-Terry MLE to produce global rankings. No equations, self-citations, or ansatzes reduce the final stratifications or the claim that mechanism mastery (not scale) drives quality to the inputs by construction. The MLE step is a standard statistical aggregation of independent judgment data; GTVH supplies an external theoretical basis rather than a self-defined loop. Absence of human validation is a correctness risk but does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GTVH supplies reliable criteria for automated pairwise humor judgments and on the statistical model that Bradley-Terry MLE produces globally consistent rankings from tournament data.

free parameters (1)

Bradley-Terry strength parameters
Maximum likelihood estimation fits one strength parameter per model to the observed pairwise win rates.

axioms (1)

domain assumption General Theory of Verbal Humor provides valid, automatable criteria for judging relative humor quality
Invoked to ground all pairwise comparisons in the tournament.

pith-pipeline@v0.9.0 · 5447 in / 1264 out tokens · 49409 ms · 2026-05-13T23:01:25.712854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Humorgen: Cognitive synergy for humor generation in large language models via persona-based distillation

Edward Ajayi and Prasenjit Mitra. Humorgen: Cognitive synergy for humor generation in large language models via persona-based distillation. https://huggingface.co/Jayi2424/HumorGen-7B, 2025 a . Preprint

work page 2025
[4]

Automatic humor detection: A comprehensive survey from theoretical foundations to large language models

Edward Ajayi and Prasenjit Mitra. Automatic humor detection: A comprehensive survey from theoretical foundations to large language models. December 2025 b . doi:10.13140/RG.2.2.24393.61288. URL https://doi.org/10.13140/RG.2.2.24393.61288. Preprint

work page doi:10.13140/rg.2.2.24393.61288 2025
[5]

Elo-rating as a tool in the sequential estimation of dominance strengths

Paul CH Albers and Han de Vries. Elo-rating as a tool in the sequential estimation of dominance strengths. Animal behaviour, pp.\ 489--495, 2001

work page 2001
[6]

The claude 3 model family: Opus, sonnet, haiku

Anthropic . The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. Model card

work page 2024
[7]

The general theory of verbal humor

Salvatore Attardo. The general theory of verbal humor. In The Routledge handbook of language and humor, pp.\ 126--142. Routledge, 2017

work page 2017
[8]

Linguistic theories of humor, volume 1

Salvatore Attardo. Linguistic theories of humor, volume 1. Walter de Gruyter GmbH & Co KG, 2024

work page 2024
[9]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

work page 1952
[11]

Santiago Castro, Luis Chiruzzo, Santiago G \'o ngora, Salar Rahili, Naihao Deng, Ignacio Sastre, Victoria Amoroso, Guillermo Rey, Aiala Ros \'a , Guillermo Moncecchi, J. A. Meaney, Juan Jos \'e Prada, and Rada Mihalcea. SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate . In Proceedings of the 20th International Workshop on Sem...

work page 2026
[12]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Improving llm leaderboards with psychometrical methodology

Denis Federiakin. Improving llm leaderboards with psychometrical methodology. arXiv preprint arXiv:2501.17200, 2025

work page arXiv 2025
[15]

Automating humor: A novel approach to joke generation using template extraction and infilling

Mayank Goel, Parameswari Krishnamurthy, and Radhika Mamidi. Automating humor: A novel approach to joke generation using template extraction and infilling. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pp.\ 442--448, 2024

work page 2024
[16]

Crowd score: A method for the evaluation of jokes using large language model ai voters as judges

Fabricio Goes, Zisen Zhou, Piotr Sawicki, Marek Grzes, and Daniel G Brown. Crowd score: A method for the evaluation of jokes using large language model ai voters as judges. arXiv preprint arXiv:2212.11214, 2022

work page arXiv 2022
[17]

How funny is chatgpt? a comparison of human-and ai-produced jokes

Drew Gorenz and Norbert Schwarz. How funny is chatgpt? a comparison of human-and ai-produced jokes. Plos one, 19 0 (7): 0 e0305364, 2024

work page 2024
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Chumor 2.0: Towards benchmarking chinese humor understanding

Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Rada Mihalcea, and Naihao Deng. Chumor 2.0: Towards benchmarking chinese humor understanding. arXiv preprint arXiv:2412.17729, 2024

work page arXiv 2024
[20]

Getting serious about humor: Crafting humor datasets with unfunny large language models

Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, and Kathleen McKeown. Getting serious about humor: Crafting humor datasets with unfunny large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 855--869, 2024

work page 2024
[21]

Semeval-2020 task 7: Assessing humor in edited news headlines

Nabil Hossain, John Krumm, Michael Gamon, and Henry Kautz. Semeval-2020 task 7: Assessing humor in edited news headlines. In Proceedings of the fourteenth workshop on semantic evaluation, pp.\ 746--758, 2020

work page 2020
[22]

Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor

Veedant Jain, Felipe dos Santos Alves Feitosa, and Gabriel Kreiman. Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor. arXiv preprint arXiv:2406.13564, 2024

work page arXiv 2024
[23]

Ai humor generation: Cognitive, social and creative skills for effective humor

Sean Kim and Lydia B Chilton. Ai humor generation: Cognitive, social and creative skills for effective humor. arXiv preprint arXiv:2502.07981, 2025

work page arXiv 2025
[24]

An overview of humor theory

Cristina Larkin-Gali \ n anes. An overview of humor theory. The Routledge handbook of language and humor, pp.\ 4--16, 2017

work page 2017
[25]

Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena

Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024

work page arXiv 2024
[26]

Which llms get the joke? probing non-stem reasoning abilities with humorbench

Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine SL Dysart-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, and Lalit Jain. Which llms get the joke? probing non-stem reasoning abilities with humorbench. arXiv preprint arXiv:2507.21476, 2025

work page arXiv 2025
[27]

Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, and Hwalsuk Lee. Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3220--3234, 2024

work page 2024
[28]

Can ai take a joke—or make one? a study of humor generation and recognition in llms

Kexin Quan, Pavithra Ramakrishnan, and Jessie Chin. Can ai take a joke—or make one? a study of humor generation and recognition in llms. In Proceedings of the 2025 Conference on Creativity and Cognition, pp.\ 431--437, 2025

work page 2025
[29]

Small but funny: A feedback-driven approach to humor distillation

Sahithya Ravi, Patrick Huber, Akshat Shrivastava, Vered Shwartz, and Arash Einolghozati. Small but funny: A feedback-driven approach to humor distillation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13078--13090, 2024

work page 2024
[30]

From punchlines to predictions: A metric to assess llm performance in identifying humor in stand-up comedy

Adrianna Romanowski, Pedro HV Valois, and Kazuhiro Fukui. From punchlines to predictions: A metric to assess llm performance in identifying humor in stand-up comedy. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pp.\ 36--46, 2025

work page 2025
[31]

Humor in pixels: Benchmarking large multimodal models understanding of online comics

Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. Humor in pixels: Benchmarking large multimodal models understanding of online comics. arXiv preprint arXiv:2509.12248, 2025

work page arXiv 2025
[32]

Not all jokes land: Evaluating large language models understanding of workplace humor

Mohammadamin Shafiei and Hamidreza Saffari. Not all jokes land: Evaluating large language models understanding of workplace humor. arXiv preprint arXiv:2506.01819, 2025

work page arXiv 2025
[33]

Clarin-pt-ldb: An open llm leaderboard for portuguese to assess language, culture and civility

Jo \ a o Silva, Lu \' s Gomes, and Ant \'o nio Branco. Clarin-pt-ldb: An open llm leaderboard for portuguese to assess language, culture and civility. arXiv preprint arXiv:2603.12872, 2026

work page arXiv 2026
[34]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Large language models for subjective language understanding: A survey

Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, and Peng Zhang. Large language models for subjective language understanding: A survey. arXiv preprint arXiv:2508.07959, 2025

work page arXiv 2025
[36]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[38]

Understanding llm leaderboards: Metrics, benchmarks, and why they matter, November 2023

Toloka Team . Understanding llm leaderboards: Metrics, benchmarks, and why they matter, November 2023. URL https://toloka.ai/blog/llm-leaderboard/. Accessed: 2026-03-23

work page 2023
[39]

A theory of humor

Thomas C Veatch. A theory of humor. 1998

work page 1998
[40]

Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps , April 2025

Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, and Hui Wang. Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps , April 2025. URL http://arxiv.org/abs/2410.10370. arXiv:2410.10370 [cs]

work page arXiv 2025
[41]

Evaluating humor generation in an improvisational comedy setting

Thomas Winters and Stijn Van der Stockt. Evaluating humor generation in an improvisational comedy setting. Computational Linguistics in the Netherlands Journal, 14: 0 505--523, 2025

work page 2025
[42]

Humour classification according to genre and technique by fine-tuning llms

Shih-Hung Wu, Tsz-Yeung Lau, and Yu-Feng Huang. Humour classification according to genre and technique by fine-tuning llms. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp.\ 156--169. Springer, 2025 a

work page 2025
[43]

One does not simply meme alone: Evaluating co-creativity between llms and humans in the generation of humor

Zhikun Wu, Thomas Weber, and Florian M \"u ller. One does not simply meme alone: Evaluating co-creativity between llms and humans in the generation of humor. In Proceedings of the 30th International Conference on Intelligent User Interfaces, pp.\ 1082--1092, 2025 b

work page 2025
[44]

Generic joke generation with moral constraints

Hiroaki Yamane. Generic joke generation with moral constraints. In International Conference on Artificial Neural Networks, pp.\ 340--355. Springer, 2024

work page 2024
[45]

Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning

Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, et al. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning. Advances in Neural Information Processing Systems, 37: 0 125264--125286, 2024

work page 2024
[46]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

work page 2023
[47]

Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humor Generation

Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humor Generation . pp.\ 13246--13257, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/html/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_i...

work page 2024
[48]

Bridging the creativity understanding gap: Small-scale human alignment enables expert-level humor ranking in llms

Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, and Jifan Zhang. Bridging the creativity understanding gap: Small-scale human alignment enables expert-level humor ranking in llms. arXiv preprint arXiv:2502.20356, 2025

work page arXiv 2025