pith. machine review for the scientific record. sign in

arxiv: 2604.19786 · v1 · submitted 2026-03-31 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

Edward Ajayi , Prasenjit Mitra

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords humor generationLLM evaluationtournament rankingGeneral Theory of Verbal Humorpairwise comparisonBradley-Terry modelmodel benchmarkingcomedic mechanisms
0
0 comments X

The pith

HumorRank ranks language models on humor generation through automated joke tournaments that reveal skill in comedic mechanisms over model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HumorRank as a tournament-based system to evaluate and compare how well large language models create humorous text. Pairwise matchups between model outputs draw on the General Theory of Verbal Humor for judgments, which are then organized through an Adaptive Swiss tournament structure and turned into overall rankings with Bradley-Terry statistical modeling. Testing across nine models from different categories produces clear stratifications that tie higher performance to better command of humor techniques. This replaces scattered individual metrics with one consistent leaderboard. The result supplies a repeatable way to measure and track advances in AI humor generation.

Core claim

HumorRank is a tournament-based evaluation framework and leaderboard that performs automated pairwise comparisons of LLM-generated humor using judgments grounded in the General Theory of Verbal Humor, aggregates those results via an Adaptive Swiss tournament, and derives globally consistent rankings through Bradley-Terry Maximum Likelihood Estimation, yielding statistically grounded model stratifications that show humor quality depends on mastery of comedic mechanisms rather than model scale.

What carries the argument

HumorRank tournament system that converts GTVH-grounded pairwise judgments into global rankings through Adaptive Swiss scheduling and Bradley-Terry MLE.

Load-bearing premise

Automated pairwise judgments based on the General Theory of Verbal Humor accurately capture true humor quality without systematic bias.

What would settle it

A direct comparison study in which human raters evaluate the same model outputs and produce model rankings that differ substantially from those generated by HumorRank.

Figures

Figures reproduced from arXiv: 2604.19786 by Edward Ajayi, Prasenjit Mitra.

Figure 1
Figure 1. Figure 1: HumorRank Leaderboard (left) and Pairwise Win-Rate Heatmap (right) showing the performance of the 9 models. Remarkably, the specialized HumorGen-7B model (Rank 4, BT = 1092.8) successfully bridges the gap between the mid-tier open-weights and the proprietary frontier, cleanly outperform￾ing models an order of magnitude larger (e.g., GPT OSS 120B, Rank 6) as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-model winning feature distributions (Llama 3.3 70B judge). Left: Humor mechanisms (% of wins). Right: Delivery features (% of wins). Frontier models dominate via Conciseness; the specialist model leads on Absurdity and Escalation; baseline models over-index on Wordplay. higher Overexplained (25.2%) and Buried Punchline (20.4%) failure rates than any other model, indicating that its aggressive structura… view at source ↗
Figure 3
Figure 3. Figure 3: Per-model failure mode distributions (Llama 3.3 70B judge). Clich´e and Weak Punchline dominate most models, but HumorGen-7B stands out with markedly higher Overexplained and Buried Punchline rates—a byproduct of its deep-structure comedic strategy. Qwen 2.5 72B failure modes are in Appendix F. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: HumorRank Leaderboard (top) and Pairwise Win-Rate Heatmap (bottom) showing the performance of the 9 models. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Four representative LLaMA judge decisions. Winner ✓ (green) and Loser × (red) are labelled directly on each joke box. Feature rows indicate winning humor traits (green), delivery strengths (blue), and loser weaknesses (red). ELO deltas are approximated from the evaluation log. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used for all 10,800 pairwise comparisons. Template variables ({headline}, {joke a}, {joke b}) are instantiated per comparison. The three feature lists (humor mechanisms, delivery, and loser features) enforce structured and consistent JSON outputs across all evaluations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-model winning feature distributions (Qwen 2.5 72B judge). Left: Humor mechanisms. Right: Delivery features. Rank patterns are consistent with the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-model failure mode distributions (Qwen 2.5 72B judge). HumorGen-7B again shows markedly higher Overexplained (49.5%) rates compared to other models, consistent with findings under the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Instructions screen (HumorRank Blind Evaluation) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample evaluated pair showing the blind comparison interface. Evaluators see two anonymized jokes (Option A and Option B) for a given headline and select the funnier response. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation in LLMs. Using the SemEval-2026 MWAHAHA test dataset, it conducts automated pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) across nine models, aggregates outcomes via an Adaptive Swiss tournament, and applies Bradley-Terry MLE to produce globally consistent rankings. The central claim is that these rankings yield statistically grounded stratifications demonstrating that humor quality is driven by mastery of comedic mechanisms rather than model scale alone.

Significance. If the automated judgments prove reliable, HumorRank would provide a valuable, scalable methodology for unified benchmarking of LLM humor generation, replacing isolated incomparable metrics with interpretable global rankings. The GTVH grounding and Bradley-Terry aggregation offer a theoretically motivated and reproducible approach that could help track progress and identify key drivers of humor capability.

major comments (2)
  1. [Evaluation pipeline] Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.
  2. [Results section] Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important areas for strengthening the manuscript. We address each major point below and commit to revisions that enhance the rigor and transparency of our evaluation framework.

read point-by-point responses
  1. Referee: Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.

    Authors: We agree that human validation is essential to substantiate the reliability of the automated GTVH judgments and rule out potential biases. In the revised manuscript, we will add a dedicated validation subsection reporting results from a human study on a stratified sample of 300 pairwise comparisons. This will include inter-annotator agreement metrics (Cohen's kappa and Fleiss' kappa), a bias audit examining correlations between judgment errors and model family/size, and qualitative analysis of disagreement cases. These additions will directly support the claim that observed stratifications reflect genuine differences in comedic mechanism mastery rather than judge artifacts. revision: yes

  2. Referee: Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

    Authors: We acknowledge that the current results section would be strengthened by explicit statistical support. In the revision, we will expand the results to include: bootstrap 95% confidence intervals on all Bradley-Terry parameters; statistical significance tests (Mann-Whitney U and permutation tests) comparing the mechanism-mastery group against scale-based groupings; and sensitivity analyses varying Adaptive Swiss parameters (e.g., round count from 4-12 and reporting Kendall tau rank stability across configurations). These will be presented with tables and figures to rigorously ground the reported stratifications. revision: yes

Circularity Check

0 steps flagged

No circularity: HumorRank rankings derive from external GTVH judgments aggregated by standard MLE without self-referential reduction

full rationale

The paper's derivation chain consists of (1) applying the external General Theory of Verbal Humor to generate automated pairwise judgments on the SemEval-2026 dataset, followed by (2) aggregation via Adaptive Swiss tournament and Bradley-Terry MLE to produce global rankings. No equations, self-citations, or ansatzes reduce the final stratifications or the claim that mechanism mastery (not scale) drives quality to the inputs by construction. The MLE step is a standard statistical aggregation of independent judgment data; GTVH supplies an external theoretical basis rather than a self-defined loop. Absence of human validation is a correctness risk but does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GTVH supplies reliable criteria for automated pairwise humor judgments and on the statistical model that Bradley-Terry MLE produces globally consistent rankings from tournament data.

free parameters (1)
  • Bradley-Terry strength parameters
    Maximum likelihood estimation fits one strength parameter per model to the observed pairwise win rates.
axioms (1)
  • domain assumption General Theory of Verbal Humor provides valid, automatable criteria for judging relative humor quality
    Invoked to ground all pairwise comparisons in the tournament.

pith-pipeline@v0.9.0 · 5447 in / 1264 out tokens · 49409 ms · 2026-05-13T23:01:25.712854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  3. [3]

    Humorgen: Cognitive synergy for humor generation in large language models via persona-based distillation

    Edward Ajayi and Prasenjit Mitra. Humorgen: Cognitive synergy for humor generation in large language models via persona-based distillation. https://huggingface.co/Jayi2424/HumorGen-7B, 2025 a . Preprint

  4. [4]

    Automatic humor detection: A comprehensive survey from theoretical foundations to large language models

    Edward Ajayi and Prasenjit Mitra. Automatic humor detection: A comprehensive survey from theoretical foundations to large language models. December 2025 b . doi:10.13140/RG.2.2.24393.61288. URL https://doi.org/10.13140/RG.2.2.24393.61288. Preprint

  5. [5]

    Elo-rating as a tool in the sequential estimation of dominance strengths

    Paul CH Albers and Han de Vries. Elo-rating as a tool in the sequential estimation of dominance strengths. Animal behaviour, pp.\ 489--495, 2001

  6. [6]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic . The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. Model card

  7. [7]

    The general theory of verbal humor

    Salvatore Attardo. The general theory of verbal humor. In The Routledge handbook of language and humor, pp.\ 126--142. Routledge, 2017

  8. [8]

    Linguistic theories of humor, volume 1

    Salvatore Attardo. Linguistic theories of humor, volume 1. Walter de Gruyter GmbH & Co KG, 2024

  9. [9]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  10. [10]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

  11. [11]

    Santiago Castro, Luis Chiruzzo, Santiago G \'o ngora, Salar Rahili, Naihao Deng, Ignacio Sastre, Victoria Amoroso, Guillermo Rey, Aiala Ros \'a , Guillermo Moncecchi, J. A. Meaney, Juan Jos \'e Prada, and Rada Mihalcea. SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate . In Proceedings of the 20th International Workshop on Sem...

  12. [12]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  14. [14]

    Improving llm leaderboards with psychometrical methodology

    Denis Federiakin. Improving llm leaderboards with psychometrical methodology. arXiv preprint arXiv:2501.17200, 2025

  15. [15]

    Automating humor: A novel approach to joke generation using template extraction and infilling

    Mayank Goel, Parameswari Krishnamurthy, and Radhika Mamidi. Automating humor: A novel approach to joke generation using template extraction and infilling. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pp.\ 442--448, 2024

  16. [16]

    Crowd score: A method for the evaluation of jokes using large language model ai voters as judges

    Fabricio Goes, Zisen Zhou, Piotr Sawicki, Marek Grzes, and Daniel G Brown. Crowd score: A method for the evaluation of jokes using large language model ai voters as judges. arXiv preprint arXiv:2212.11214, 2022

  17. [17]

    How funny is chatgpt? a comparison of human-and ai-produced jokes

    Drew Gorenz and Norbert Schwarz. How funny is chatgpt? a comparison of human-and ai-produced jokes. Plos one, 19 0 (7): 0 e0305364, 2024

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Chumor 2.0: Towards benchmarking chinese humor understanding

    Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Rada Mihalcea, and Naihao Deng. Chumor 2.0: Towards benchmarking chinese humor understanding. arXiv preprint arXiv:2412.17729, 2024

  20. [20]

    Getting serious about humor: Crafting humor datasets with unfunny large language models

    Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, and Kathleen McKeown. Getting serious about humor: Crafting humor datasets with unfunny large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 855--869, 2024

  21. [21]

    Semeval-2020 task 7: Assessing humor in edited news headlines

    Nabil Hossain, John Krumm, Michael Gamon, and Henry Kautz. Semeval-2020 task 7: Assessing humor in edited news headlines. In Proceedings of the fourteenth workshop on semantic evaluation, pp.\ 746--758, 2020

  22. [22]

    Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor

    Veedant Jain, Felipe dos Santos Alves Feitosa, and Gabriel Kreiman. Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor. arXiv preprint arXiv:2406.13564, 2024

  23. [23]

    Ai humor generation: Cognitive, social and creative skills for effective humor

    Sean Kim and Lydia B Chilton. Ai humor generation: Cognitive, social and creative skills for effective humor. arXiv preprint arXiv:2502.07981, 2025

  24. [24]

    An overview of humor theory

    Cristina Larkin-Gali \ n anes. An overview of humor theory. The Routledge handbook of language and humor, pp.\ 4--16, 2017

  25. [25]

    Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena

    Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024

  26. [26]

    Which llms get the joke? probing non-stem reasoning abilities with humorbench

    Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine SL Dysart-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, and Lalit Jain. Which llms get the joke? probing non-stem reasoning abilities with humorbench. arXiv preprint arXiv:2507.21476, 2025

  27. [27]

    Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark

    Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, and Hwalsuk Lee. Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3220--3234, 2024

  28. [28]

    Can ai take a joke—or make one? a study of humor generation and recognition in llms

    Kexin Quan, Pavithra Ramakrishnan, and Jessie Chin. Can ai take a joke—or make one? a study of humor generation and recognition in llms. In Proceedings of the 2025 Conference on Creativity and Cognition, pp.\ 431--437, 2025

  29. [29]

    Small but funny: A feedback-driven approach to humor distillation

    Sahithya Ravi, Patrick Huber, Akshat Shrivastava, Vered Shwartz, and Arash Einolghozati. Small but funny: A feedback-driven approach to humor distillation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13078--13090, 2024

  30. [30]

    From punchlines to predictions: A metric to assess llm performance in identifying humor in stand-up comedy

    Adrianna Romanowski, Pedro HV Valois, and Kazuhiro Fukui. From punchlines to predictions: A metric to assess llm performance in identifying humor in stand-up comedy. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pp.\ 36--46, 2025

  31. [31]

    Humor in pixels: Benchmarking large multimodal models understanding of online comics

    Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. Humor in pixels: Benchmarking large multimodal models understanding of online comics. arXiv preprint arXiv:2509.12248, 2025

  32. [32]

    Not all jokes land: Evaluating large language models understanding of workplace humor

    Mohammadamin Shafiei and Hamidreza Saffari. Not all jokes land: Evaluating large language models understanding of workplace humor. arXiv preprint arXiv:2506.01819, 2025

  33. [33]

    Clarin-pt-ldb: An open llm leaderboard for portuguese to assess language, culture and civility

    Jo \ a o Silva, Lu \' s Gomes, and Ant \'o nio Branco. Clarin-pt-ldb: An open llm leaderboard for portuguese to assess language, culture and civility. arXiv preprint arXiv:2603.12872, 2026

  34. [34]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

  35. [35]

    Large language models for subjective language understanding: A survey

    Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, and Peng Zhang. Large language models for subjective language understanding: A survey. arXiv preprint arXiv:2508.07959, 2025

  36. [36]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026

  37. [37]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  38. [38]

    Understanding llm leaderboards: Metrics, benchmarks, and why they matter, November 2023

    Toloka Team . Understanding llm leaderboards: Metrics, benchmarks, and why they matter, November 2023. URL https://toloka.ai/blog/llm-leaderboard/. Accessed: 2026-03-23

  39. [39]

    A theory of humor

    Thomas C Veatch. A theory of humor. 1998

  40. [40]

    Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps , April 2025

    Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, and Hui Wang. Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps , April 2025. URL http://arxiv.org/abs/2410.10370. arXiv:2410.10370 [cs]

  41. [41]

    Evaluating humor generation in an improvisational comedy setting

    Thomas Winters and Stijn Van der Stockt. Evaluating humor generation in an improvisational comedy setting. Computational Linguistics in the Netherlands Journal, 14: 0 505--523, 2025

  42. [42]

    Humour classification according to genre and technique by fine-tuning llms

    Shih-Hung Wu, Tsz-Yeung Lau, and Yu-Feng Huang. Humour classification according to genre and technique by fine-tuning llms. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp.\ 156--169. Springer, 2025 a

  43. [43]

    One does not simply meme alone: Evaluating co-creativity between llms and humans in the generation of humor

    Zhikun Wu, Thomas Weber, and Florian M \"u ller. One does not simply meme alone: Evaluating co-creativity between llms and humans in the generation of humor. In Proceedings of the 30th International Conference on Intelligent User Interfaces, pp.\ 1082--1092, 2025 b

  44. [44]

    Generic joke generation with moral constraints

    Hiroaki Yamane. Generic joke generation with moral constraints. In International Conference on Artificial Neural Networks, pp.\ 340--355. Springer, 2024

  45. [45]

    Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning

    Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, et al. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning. Advances in Neural Information Processing Systems, 37: 0 125264--125286, 2024

  46. [46]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

  47. [47]

    Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humor Generation

    Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humor Generation . pp.\ 13246--13257, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/html/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_i...

  48. [48]

    Bridging the creativity understanding gap: Small-scale human alignment enables expert-level humor ranking in llms

    Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, and Jifan Zhang. Bridging the creativity understanding gap: Small-scale human alignment enables expert-level humor ranking in llms. arXiv preprint arXiv:2502.20356, 2025