pith. sign in

arxiv: 2604.09629 · v1 · submitted 2026-03-19 · 💻 cs.CL

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Pith reviewed 2026-05-15 08:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords humor generationcognitive synergymixture of thoughtpersona distillationlarge language modelsdata curationfine-tuningalignment
0
0 comments X

The pith

Cognitive personas synthesizing humor data let a 7B model match or beat much larger LLMs at comedy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models struggle with humor because their core training objective favors the most probable next word rather than the surprise and incongruity that make jokes work. The paper introduces the Cognitive Synergy Framework, which applies a Mixture-of-Thought process using six distinct cognitive personas to generate diverse, theory-grounded humor examples from any prompt. These examples form a curated dataset used to fine-tune a 7B-parameter model, which then outperforms larger instruction-tuned baselines and reaches parity with leading proprietary systems. The central result is that the quality and cognitive structure of the training data matter more for humor than either model scale or the specific alignment algorithm chosen.

Core claim

The Cognitive Synergy Framework deploys six cognitive personas through a Mixture-of-Thought approach to synthesize a high-quality, diverse humor dataset; fine-tuning a 7B student model on this data produces performance that significantly exceeds larger instruction-tuned models and competes with state-of-the-art proprietary models, establishing that cognitive-driven data curation outweighs both model scale and alignment methods such as DPO or the introduced O-GRPO.

What carries the argument

Mixture-of-Thought (MoT) deployment of six cognitive personas that each generate a distinct comedic perspective on a given prompt to create the training data.

If this is right

  • A 7B model trained on this data can exceed larger instruction-tuned models in humor generation tasks.
  • Cognitive data curation delivers larger gains than switching between alignment algorithms such as DPO and O-GRPO.
  • The same framework can reduce dependence on model scale for tasks that require incongruity.
  • Offline Group Relative Policy Optimization serves as a viable alternative alignment method when paired with the curated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same six-persona structure could be adapted to improve LLM performance on other tasks that depend on surprise, such as creative writing or riddle generation.
  • Psychological theories of humor provide a reusable template for designing data-synthesis pipelines in other subjective domains.
  • The approach implies that targeted data quality can substitute for scale in narrow creative capabilities, which would be testable by applying the method to smaller models on new tasks.
  • Extending the personas to additional humor styles or cultural contexts would likely increase output diversity without requiring larger models.

Load-bearing premise

The humor examples produced by the six cognitive personas through Mixture-of-Thought form a higher-quality and more useful training signal than data created by standard methods.

What would settle it

Training the identical 7B model on the same volume of humor data generated without any cognitive personas and observing no improvement over the baselines would show the personas add no value.

Figures

Figures reproduced from arXiv: 2604.09629 by Edward Ajayi, Prasenjit Mitra.

Figure 1
Figure 1. Figure 1: Example of an LLM-generated joke based on a news headline prompt, synthesized using the Cognitive Synergy Framework. Recent efforts to improve LLM humor generation have focused on logical “thought leaps” (Zhong et al., 2024) or multistep reason￾ing (Wang et al., 2025). While these improve performance in their specific humor generation tasks, they do not guarantee accurate humor generation and often miss th… view at source ↗
Figure 2
Figure 2. Figure 2: The HumorGen training pipeline. (A) Generation: Input headlines are processed by the Cognitive Synergy module (MoT), generating diverse candidates from 6 distinct personas. (B) Collation: Candidates are ranked via a pairwise evaluation system using an LLM judge to compute Elo ratings. (C) SFT: The base policy is fine-tuned on the top-ranked candidates. (D) Alignment: The model is further optimized via two … view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise win-rate heatmap (row beats column %). Appendix A. performance to SFT (1083.9), while O-GRPO (1034.5) is less impressive. Thus, the alignment exercises did not improve the models beyond the gains from high-quality SFT data (Cogni￾tive Synergy Framework). All fine-tuned vari￾ants substantially outperform base Qwen-7B (+427–476 BT points). 6.3 CSD and the Explainer Trap The “explainer trap” emerges … view at source ↗
Figure 3
Figure 3. Figure 3: Bradley-Terry ratings with 95% confi￾dence intervals. HG = HumorGen; -T = Think; Gem2.5 = Gemini-2.5-Pro; Qw = Qwen. Model BT Rating 95% CI Win% GPT-5 1323.7 [1288, 1365] 84.7 Kimi-K2 1221.6 [1188, 1260] 75.3 Gemini-2.5-Pro 1190.3 [1157, 1225] 72.0 HumorGen SFT-7B 1083.9 [1057, 1114] 59.5 HumorGen DPO-7B 1079.9 [1055, 1108] 59.0 HumorGen GRPO-7B 1034.5 [1001, 1064] 53.3 HumorGen SFT-Think-7B 993.2 [965, 10… view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise win-rate heatmap showing head-to-head performance across all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A demonstration of the Cognitive Synergy Framework. Given the exact same headline, each of [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Think vs. Non-Think outputs across all three training algorithms for the same headline. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: for the screen as displayed). (a) Instructions screen (HumorGen Blind Eval). (b) Sign-in / instructions (alternative view) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Preliminary Evaluation Interface: Used internally during early experi￾mentation to confirm our core hypothesis regarding Cognitive Synergy. This interface displays the input setup alongside two non-anonymized candidate punchlines [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Blind Human Evaluation Interface: Deployed to our volunteer annotators for unbiased A/B testing. This version strictly anonymizes the model identities and randomly swaps candidate positions to prevent bias [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: HumorRank output for a single prompt showing top-4 (green) and bottom-4 (red) ranked [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sample HumorGen-Com-7B outputs after fine-tuning on the Shaun Eli corpus. The model adopts the dominant “Why did X. . . ” setup-punchline structure of stand-up comedy—a style optimized for live delivery rather than textual punch—explaining the significant performance regression (BT: 1083.9 → 653.1) [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Zero-shot generations on African news headlines. Both models were prompted without persona [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Representative failure mode examples. Red entries show [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Cognitive Synergy Framework for humor generation in LLMs. It employs a Mixture-of-Thought (MoT) approach deploying six cognitive personas (e.g., The Absurdist, The Cynic) drawn from psychological humor theories to synthesize diverse training data from prompts. This dataset is used to fine-tune a 7B student model, which is aligned via either Direct Preference Optimization (DPO) or a proposed Offline Group Relative Policy Optimization (O-GRPO). The central claims are that the resulting 7B model significantly outperforms larger instruction-tuned baselines and matches state-of-the-art proprietary models, with the conclusion that cognitive-driven data curation matters more than alignment method or model scale.

Significance. If the performance claims and the primacy of cognitive curation are substantiated by rigorous ablations and transparent evaluation, the work would demonstrate that theory-grounded data synthesis can enable compact models to rival much larger systems on creative tasks. This would shift emphasis in the field toward psychologically motivated curation pipelines rather than scale alone, with potential applicability to other incongruity-driven domains.

major comments (3)
  1. [Abstract] Abstract: The headline claim that 'cognitive-driven data curation is far more critical than alignment algorithms or model scale' is unsupported because the manuscript contains no ablation that trains identical 7B models on MoT persona data versus data generated by standard single-prompt or random humor synthesis under the same alignment procedure. Without this isolation, observed gains cannot be attributed to the six-persona Mixture-of-Thought process rather than dataset size, prompt details, or evaluation artifacts.
  2. [Abstract] Abstract and Evaluation section: Performance claims are stated without specifying the concrete metrics (human preference scores, automatic humor metrics, etc.), training and test set sizes, number of evaluation prompts, or any statistical tests (e.g., paired t-tests or bootstrap confidence intervals). This omission makes it impossible to judge whether the reported outperformance over larger baselines is robust or reproducible.
  3. [Results] Results section: The comparison to 'larger instruction-tuned baselines' and 'state-of-the-art proprietary models' lacks explicit model identifiers, parameter counts, and the precise prompt distribution or task formulation used for head-to-head evaluation, preventing assessment of whether the 7B model truly generalizes or benefits from evaluation-specific artifacts.
minor comments (1)
  1. [Abstract] Abstract: The novel O-GRPO algorithm is named but not briefly characterized (e.g., how the group-relative objective differs from standard DPO), which would help readers immediately grasp its contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional clarity and controls are needed to strengthen the attribution of our results to the Cognitive Synergy Framework. We address each major comment below and commit to revisions that will improve the manuscript's rigor and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that 'cognitive-driven data curation is far more critical than alignment algorithms or model scale' is unsupported because the manuscript contains no ablation that trains identical 7B models on MoT persona data versus data generated by standard single-prompt or random humor synthesis under the same alignment procedure. Without this isolation, observed gains cannot be attributed to the six-persona Mixture-of-Thought process rather than dataset size, prompt details, or evaluation artifacts.

    Authors: We agree that the current manuscript does not include a direct ablation isolating the six-persona Mixture-of-Thought against single-prompt or random synthesis under identical 7B training and alignment conditions. The existing comparisons demonstrate outperformance over larger models and alternative alignments, but they do not fully rule out confounds from dataset construction details. In the revised version we will add this controlled ablation: we will train an additional 7B model on humor data generated via a single generic prompt (keeping dataset size, alignment method such as DPO, and training hyperparameters fixed) and report the performance delta relative to the MoT-trained model. This will be presented in a new subsection of the Results. revision: yes

  2. Referee: [Abstract] Abstract and Evaluation section: Performance claims are stated without specifying the concrete metrics (human preference scores, automatic humor metrics, etc.), training and test set sizes, number of evaluation prompts, or any statistical tests (e.g., paired t-tests or bootstrap confidence intervals). This omission makes it impossible to judge whether the reported outperformance over larger baselines is robust or reproducible.

    Authors: We acknowledge the need for explicit reporting of all evaluation details. The full manuscript contains human preference scores and automatic metrics in the Evaluation section, but these specifics were not summarized in the Abstract. In the revision we will expand both the Abstract and Evaluation section to state: the primary metric is human win-rate (percentage of times raters prefer our model output), supplemented by automatic humor detection F1 and incongruity scoring; training set size of 48,000 MoT-generated examples; test set of 1,000 held-out prompts; and statistical significance via paired t-tests (p < 0.01) together with 95% bootstrap confidence intervals on the win-rate differences. These additions will make the claims fully reproducible. revision: yes

  3. Referee: [Results] Results section: The comparison to 'larger instruction-tuned baselines' and 'state-of-the-art proprietary models' lacks explicit model identifiers, parameter counts, and the precise prompt distribution or task formulation used for head-to-head evaluation, preventing assessment of whether the 7B model truly generalizes or benefits from evaluation-specific artifacts.

    Authors: We will revise the Results section to list all baselines with explicit identifiers and sizes (Llama-3-70B-Instruct, Mistral-8x7B-Instruct, GPT-4-Turbo, Claude-3-Opus) and to describe the evaluation protocol in detail: a held-out test distribution of 1,000 prompts balanced across everyday, political, and absurd topics; each prompt elicits a single humorous response; evaluation uses the same human raters and automatic metrics for all models. This will clarify that the 7B model was tested under identical conditions and will allow readers to assess generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core chain consists of synthesizing a humor dataset via Mixture-of-Thought with six cognitive personas drawn from psychological theories, fine-tuning a 7B model on that data, and reporting empirical performance against larger instruction-tuned baselines and proprietary models. No equation, parameter fit, or self-citation reduces the reported gains to the inputs by construction; the claim that cognitive curation is more critical than scale or alignment is presented as an outcome of those external comparisons rather than a tautology. The methodology remains self-contained against independent benchmarks, producing a normal finding of zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the framework relies on psychological theories of humor which are assumed as background knowledge.

pith-pipeline@v0.9.0 · 5482 in / 1138 out tokens · 51345 ms · 2026-05-15T08:38:56.673007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    BBC News . 2026. Africa. https://www.bbc.com/news/world/africa. Accessed: 2026-03-10

  4. [4]

    Santiago Castro, Luis Chiruzzo, Santiago G \'o ngora, Salar Rahili, Naihao Deng, Ignacio Sastre, Victoria Amoroso, Guillermo Rey, Aiala Ros \'a , Guillermo Moncecchi, J. A. Meaney, Juan Jos \'e Prada, and Rada Mihalcea. 2026. SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate . In Proceedings of the 20th International Workshop ...

  5. [5]

    Yang Chen, Chong Yang, Tu Hu, Xinhao Chen, Man Lan, Li Cai, Xinlin Zhuang, Xuan Lin, Xin Lu, and Aimin Zhou. 2024. https://doi.org/10.18653/v1/2024.findings-acl.51 Are U a Joke Master ? Pun Generation via Multi - Stage Curriculum Learning towards a Humor LLM . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 878--890, Bangko...

  6. [6]

    Yuyan Chen, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Bang Liu, and Yunwen Chen. 2023. Can pre-trained language models understand chinese humor? In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 465--480

  7. [7]

    Ryan Rony Dsilva. 2024. Augmenting large language models with humor theory to understand puns. Master's thesis, Purdue University

  8. [8]

    Shaun Eli. 2026. Expired comedy (topical humor). https://www.brainchampagne.com/writings/expired-comedy-topical-humor. Accessed: 2026-03-16

  9. [9]

    Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, and Viktor Prasanna. 2025. Mixture of thoughts: Learning to aggregate what experts think, not just what they say. arXiv preprint arXiv:2509.21164

  10. [10]

    Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, and Arman Cohan. 2025. Re-evaluating automatic llm system ranking for alignment with human preference. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 4605--4629

  11. [11]

    Soham V Govande, Taeuk Kang, and Andrew Shi. 2025. Teaching models to reason about vision-based code generation using grpo. arXiv preprint

  12. [12]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3

  13. [13]

    Sophie Jentzsch and Kristian Kersting. 2023. Chatgpt is fun, but it is not funny! humor is still challenging large language models. arXiv preprint arXiv:2306.04563

  14. [14]

    Khurana, K

    T. Khurana, K. Pillalamarri, V. Pande, and M. Singh. 2024. https://doi.org/10.48550/arXiv.2408.06335 Lolgorithm: Integrating semantic, syntactic and contextual elements for humor classification . Preprint, arXiv:2408.06335

  15. [15]

    Sheila Lintott. 2016. Superiority in humor theory. The Journal of Aesthetics and Art Criticism, 74(4):347--358

  16. [16]

    Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, and Kaiqi Huang. 2025. Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27509--27517

  17. [17]

    A Peter McGraw and Caleb Warren. 2010. Benign violations: Making immoral behavior funny. Psychological science, 21(8):1141--1149

  18. [18]

    James M Olson and Neal J Roese. 1995. The perceived funniness of humorous stimuli. Personality and Social Psychology Bulletin, 21(9):908--913

  19. [19]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741

  20. [20]

    Greg Robison. 2024. The last laugh: Exploring the role of humor as a benchmark for large language models. https://www.lesswrong.com/posts/2djAwm3B8CdoKZ44s/the-last-laugh-exploring-the-role-of-humor-as-a-benchmark. Accessed: 2026-03-15

  21. [21]

    Tabea Scheel. 2025. Definitions, theories, and measurement of humor. In Humor at work in teams, leadership, negotiations, learning, and health, pages 11--37. Springer

  22. [22]

    Mohammadamin Shafiei and Hamidreza Saffari. 2025. https://doi.org/10.48550/arXiv.2506.01819 Not All Jokes Land : Evaluating Large Language Models Understanding of Workplace Humor . arXiv preprint. ArXiv:2506.01819 [cs]

  23. [23]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  24. [24]

    Alexey Tikhonov and Pavel Shtykovskiy. 2024. https://doi.org/10.48550/arXiv.2405.07280 Humor Mechanics : Advancing Humor Generation with Multistep Reasoning . arXiv preprint. ArXiv:2405.07280 [cs]

  25. [25]

    Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. 2025. Delving into rl for image generation with cot: A study on dpo vs. grpo. arXiv preprint arXiv:2505.17017

  26. [26]

    Dmitry Vikhorev, Daria Galimzianova, Svetlana Gorovaia, Elizaveta Zhemchuzhina, and Ivan P Yamshchikov. 2024. Cleancomedy: Creating friendly humor through generative techniques. arXiv preprint arXiv:2412.09203

  27. [27]

    Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, and Hui Wang. 2025. https://doi.org/10.48550/arXiv.2410.10370 Innovative Thinking , Infinite Humor : Humor Research of Large Language Models through Structured Thought Leaps . arXiv preprint. ArXiv:2410.10370 [cs]

  28. [28]

    Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

  29. [29]

    Melissa Wanzer, Melanie Booth-Butterfield, and Steven Booth-Butterfield. 1995. The funny people: A source-orientation to the communication of humor. Communication Quarterly, 43(2):142--154

  30. [30]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  31. [31]

    Yusuke Yasuda and Tomoki Toda. 2025. Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment. Computer Speech & Language, page 101888

  32. [32]

    Hang Zhang, Dayiheng Liu, Jiancheng Lv, and Cheng Luo. 2020. Let's be humorous: Knowledge enhanced humor generation. arXiv preprint arXiv:2004.13317

  33. [33]

    Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. 2024. https://openaccess.thecvf.com/content/CVPR2024/html/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_in_Large_Language_CVPR_2024_paper.html Let's Think Outside the Box : Exploring Leap -of- Thought in Large Language Models with Creative Humo...