HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation
Pith reviewed 2026-05-15 08:38 UTC · model grok-4.3
The pith
Cognitive personas synthesizing humor data let a 7B model match or beat much larger LLMs at comedy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Cognitive Synergy Framework deploys six cognitive personas through a Mixture-of-Thought approach to synthesize a high-quality, diverse humor dataset; fine-tuning a 7B student model on this data produces performance that significantly exceeds larger instruction-tuned models and competes with state-of-the-art proprietary models, establishing that cognitive-driven data curation outweighs both model scale and alignment methods such as DPO or the introduced O-GRPO.
What carries the argument
Mixture-of-Thought (MoT) deployment of six cognitive personas that each generate a distinct comedic perspective on a given prompt to create the training data.
If this is right
- A 7B model trained on this data can exceed larger instruction-tuned models in humor generation tasks.
- Cognitive data curation delivers larger gains than switching between alignment algorithms such as DPO and O-GRPO.
- The same framework can reduce dependence on model scale for tasks that require incongruity.
- Offline Group Relative Policy Optimization serves as a viable alternative alignment method when paired with the curated data.
Where Pith is reading between the lines
- The same six-persona structure could be adapted to improve LLM performance on other tasks that depend on surprise, such as creative writing or riddle generation.
- Psychological theories of humor provide a reusable template for designing data-synthesis pipelines in other subjective domains.
- The approach implies that targeted data quality can substitute for scale in narrow creative capabilities, which would be testable by applying the method to smaller models on new tasks.
- Extending the personas to additional humor styles or cultural contexts would likely increase output diversity without requiring larger models.
Load-bearing premise
The humor examples produced by the six cognitive personas through Mixture-of-Thought form a higher-quality and more useful training signal than data created by standard methods.
What would settle it
Training the identical 7B model on the same volume of humor data generated without any cognitive personas and observing no improvement over the baselines would show the personas add no value.
Figures
read the original abstract
Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Cognitive Synergy Framework for humor generation in LLMs. It employs a Mixture-of-Thought (MoT) approach deploying six cognitive personas (e.g., The Absurdist, The Cynic) drawn from psychological humor theories to synthesize diverse training data from prompts. This dataset is used to fine-tune a 7B student model, which is aligned via either Direct Preference Optimization (DPO) or a proposed Offline Group Relative Policy Optimization (O-GRPO). The central claims are that the resulting 7B model significantly outperforms larger instruction-tuned baselines and matches state-of-the-art proprietary models, with the conclusion that cognitive-driven data curation matters more than alignment method or model scale.
Significance. If the performance claims and the primacy of cognitive curation are substantiated by rigorous ablations and transparent evaluation, the work would demonstrate that theory-grounded data synthesis can enable compact models to rival much larger systems on creative tasks. This would shift emphasis in the field toward psychologically motivated curation pipelines rather than scale alone, with potential applicability to other incongruity-driven domains.
major comments (3)
- [Abstract] Abstract: The headline claim that 'cognitive-driven data curation is far more critical than alignment algorithms or model scale' is unsupported because the manuscript contains no ablation that trains identical 7B models on MoT persona data versus data generated by standard single-prompt or random humor synthesis under the same alignment procedure. Without this isolation, observed gains cannot be attributed to the six-persona Mixture-of-Thought process rather than dataset size, prompt details, or evaluation artifacts.
- [Abstract] Abstract and Evaluation section: Performance claims are stated without specifying the concrete metrics (human preference scores, automatic humor metrics, etc.), training and test set sizes, number of evaluation prompts, or any statistical tests (e.g., paired t-tests or bootstrap confidence intervals). This omission makes it impossible to judge whether the reported outperformance over larger baselines is robust or reproducible.
- [Results] Results section: The comparison to 'larger instruction-tuned baselines' and 'state-of-the-art proprietary models' lacks explicit model identifiers, parameter counts, and the precise prompt distribution or task formulation used for head-to-head evaluation, preventing assessment of whether the 7B model truly generalizes or benefits from evaluation-specific artifacts.
minor comments (1)
- [Abstract] Abstract: The novel O-GRPO algorithm is named but not briefly characterized (e.g., how the group-relative objective differs from standard DPO), which would help readers immediately grasp its contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional clarity and controls are needed to strengthen the attribution of our results to the Cognitive Synergy Framework. We address each major comment below and commit to revisions that will improve the manuscript's rigor and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that 'cognitive-driven data curation is far more critical than alignment algorithms or model scale' is unsupported because the manuscript contains no ablation that trains identical 7B models on MoT persona data versus data generated by standard single-prompt or random humor synthesis under the same alignment procedure. Without this isolation, observed gains cannot be attributed to the six-persona Mixture-of-Thought process rather than dataset size, prompt details, or evaluation artifacts.
Authors: We agree that the current manuscript does not include a direct ablation isolating the six-persona Mixture-of-Thought against single-prompt or random synthesis under identical 7B training and alignment conditions. The existing comparisons demonstrate outperformance over larger models and alternative alignments, but they do not fully rule out confounds from dataset construction details. In the revised version we will add this controlled ablation: we will train an additional 7B model on humor data generated via a single generic prompt (keeping dataset size, alignment method such as DPO, and training hyperparameters fixed) and report the performance delta relative to the MoT-trained model. This will be presented in a new subsection of the Results. revision: yes
-
Referee: [Abstract] Abstract and Evaluation section: Performance claims are stated without specifying the concrete metrics (human preference scores, automatic humor metrics, etc.), training and test set sizes, number of evaluation prompts, or any statistical tests (e.g., paired t-tests or bootstrap confidence intervals). This omission makes it impossible to judge whether the reported outperformance over larger baselines is robust or reproducible.
Authors: We acknowledge the need for explicit reporting of all evaluation details. The full manuscript contains human preference scores and automatic metrics in the Evaluation section, but these specifics were not summarized in the Abstract. In the revision we will expand both the Abstract and Evaluation section to state: the primary metric is human win-rate (percentage of times raters prefer our model output), supplemented by automatic humor detection F1 and incongruity scoring; training set size of 48,000 MoT-generated examples; test set of 1,000 held-out prompts; and statistical significance via paired t-tests (p < 0.01) together with 95% bootstrap confidence intervals on the win-rate differences. These additions will make the claims fully reproducible. revision: yes
-
Referee: [Results] Results section: The comparison to 'larger instruction-tuned baselines' and 'state-of-the-art proprietary models' lacks explicit model identifiers, parameter counts, and the precise prompt distribution or task formulation used for head-to-head evaluation, preventing assessment of whether the 7B model truly generalizes or benefits from evaluation-specific artifacts.
Authors: We will revise the Results section to list all baselines with explicit identifiers and sizes (Llama-3-70B-Instruct, Mistral-8x7B-Instruct, GPT-4-Turbo, Claude-3-Opus) and to describe the evaluation protocol in detail: a held-out test distribution of 1,000 prompts balanced across everyday, political, and absurd topics; each prompt elicits a single humorous response; evaluation uses the same human raters and automatic metrics for all models. This will clarify that the 7B model was tested under identical conditions and will allow readers to assess generalization. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's core chain consists of synthesizing a humor dataset via Mixture-of-Thought with six cognitive personas drawn from psychological theories, fine-tuning a 7B model on that data, and reporting empirical performance against larger instruction-tuned baselines and proprietary models. No equation, parameter fit, or self-citation reduces the reported gains to the inputs by construction; the claim that cognitive curation is more critical than scale or alignment is presented as an outcome of those external comparisons rather than a tautology. The methodology remains self-contained against independent benchmarks, producing a normal finding of zero circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cognitive Synergy Framework... six cognitive personas... Mixture-of-Thought (MoT)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
BBC News . 2026. Africa. https://www.bbc.com/news/world/africa. Accessed: 2026-03-10
work page 2026
-
[4]
Santiago Castro, Luis Chiruzzo, Santiago G \'o ngora, Salar Rahili, Naihao Deng, Ignacio Sastre, Victoria Amoroso, Guillermo Rey, Aiala Ros \'a , Guillermo Moncecchi, J. A. Meaney, Juan Jos \'e Prada, and Rada Mihalcea. 2026. SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate . In Proceedings of the 20th International Workshop ...
work page 2026
-
[5]
Yang Chen, Chong Yang, Tu Hu, Xinhao Chen, Man Lan, Li Cai, Xinlin Zhuang, Xuan Lin, Xin Lu, and Aimin Zhou. 2024. https://doi.org/10.18653/v1/2024.findings-acl.51 Are U a Joke Master ? Pun Generation via Multi - Stage Curriculum Learning towards a Humor LLM . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 878--890, Bangko...
-
[6]
Yuyan Chen, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Bang Liu, and Yunwen Chen. 2023. Can pre-trained language models understand chinese humor? In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 465--480
work page 2023
-
[7]
Ryan Rony Dsilva. 2024. Augmenting large language models with humor theory to understand puns. Master's thesis, Purdue University
work page 2024
-
[8]
Shaun Eli. 2026. Expired comedy (topical humor). https://www.brainchampagne.com/writings/expired-comedy-topical-humor. Accessed: 2026-03-16
work page 2026
- [9]
-
[10]
Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, and Arman Cohan. 2025. Re-evaluating automatic llm system ranking for alignment with human preference. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 4605--4629
work page 2025
-
[11]
Soham V Govande, Taeuk Kang, and Andrew Shi. 2025. Teaching models to reason about vision-based code generation using grpo. arXiv preprint
work page 2025
-
[12]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3
work page 2022
- [13]
-
[14]
T. Khurana, K. Pillalamarri, V. Pande, and M. Singh. 2024. https://doi.org/10.48550/arXiv.2408.06335 Lolgorithm: Integrating semantic, syntactic and contextual elements for humor classification . Preprint, arXiv:2408.06335
-
[15]
Sheila Lintott. 2016. Superiority in humor theory. The Journal of Aesthetics and Art Criticism, 74(4):347--358
work page 2016
-
[16]
Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, and Kaiqi Huang. 2025. Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27509--27517
work page 2025
-
[17]
A Peter McGraw and Caleb Warren. 2010. Benign violations: Making immoral behavior funny. Psychological science, 21(8):1141--1149
work page 2010
-
[18]
James M Olson and Neal J Roese. 1995. The perceived funniness of humorous stimuli. Personality and Social Psychology Bulletin, 21(9):908--913
work page 1995
-
[19]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741
work page 2023
-
[20]
Greg Robison. 2024. The last laugh: Exploring the role of humor as a benchmark for large language models. https://www.lesswrong.com/posts/2djAwm3B8CdoKZ44s/the-last-laugh-exploring-the-role-of-humor-as-a-benchmark. Accessed: 2026-03-15
work page 2024
-
[21]
Tabea Scheel. 2025. Definitions, theories, and measurement of humor. In Humor at work in teams, leadership, negotiations, learning, and health, pages 11--37. Springer
work page 2025
-
[22]
Mohammadamin Shafiei and Hamidreza Saffari. 2025. https://doi.org/10.48550/arXiv.2506.01819 Not All Jokes Land : Evaluating Large Language Models Understanding of Workplace Humor . arXiv preprint. ArXiv:2506.01819 [cs]
-
[23]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Alexey Tikhonov and Pavel Shtykovskiy. 2024. https://doi.org/10.48550/arXiv.2405.07280 Humor Mechanics : Advancing Humor Generation with Multistep Reasoning . arXiv preprint. ArXiv:2405.07280 [cs]
- [25]
- [26]
-
[27]
Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, and Hui Wang. 2025. https://doi.org/10.48550/arXiv.2410.10370 Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps . arXiv preprint. ArXiv:2410.10370 [cs]
-
[28]
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...
work page 2024
-
[29]
Melissa Wanzer, Melanie Booth-Butterfield, and Steven Booth-Butterfield. 1995. The funny people: A source-orientation to the communication of humor. Communication Quarterly, 43(2):142--154
work page 1995
-
[30]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
work page 2022
-
[31]
Yusuke Yasuda and Tomoki Toda. 2025. Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment. Computer Speech & Language, page 101888
work page 2025
- [32]
-
[33]
Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. 2024. https://openaccess.thecvf.com/content/CVPR2024/html/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_in_Large_Language_CVPR_2024_paper.html Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.