arxiv: 2605.10853 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Grounded Satirical Generation with RAG

Linyao Du, Ona de Gibert, Oona Itkonen, Yuxin Su

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords satire generationretrieval augmented generationhumor generationLLM evaluationFinnish languagedictionary definitionspolitical relevance

0 comments

The pith

Retrieval from news makes generated satirical definitions more political than humorous.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a pipeline that retrieves from current news to ground the creation of satirical dictionary definitions in Finnish. Human annotators rate outputs from multiple experimental setups and find the results consistently more political than humorous. Topic-based word selection and the addition of retrieval both raise political relevance scores, but produce no reliable lift in humor ratings. Large language models used as judges match the human ratings on political relevance yet diverge sharply on humor. The authors release the annotated data and code to allow others to test grounded satire methods further.

Core claim

A retrieval-augmented pipeline over current news produces satirical dictionary definitions that six human annotators rate higher on political relevance than on humor. Topic-based word choice and the retrieval step each increase political relevance, but neither step improves perceived humor. Large language models align with human political ratings yet perform poorly when asked to judge humor.

What carries the argument

Retrieval-augmented generation over current news to ground satirical dictionary definitions, evaluated through a task-specific human annotation framework measuring political relevance and humor across conditions.

If this is right

Human annotators perceive the generated definitions as more political than humorous across tested conditions.
Topic-based word selection and retrieval-augmented generation each raise the political relevance of the outputs.
Neither technique produces clear gains in humor ratings.
Large language models match human judgments on political relevance but not on humor.
The released dataset supports further experiments on grounded satire generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap between political relevance and humor suggests that current grounding techniques supply context but leave the creative, subjective layer of satire under-served.
The same pipeline could be tested in other languages to check whether the political-over-humor pattern is language-specific or more general.
Better humor evaluation might require models trained on larger collections of annotated satirical examples rather than zero-shot prompting.

Load-bearing premise

Ratings of political relevance and humor by six human annotators provide a stable and generalizable measure of satire quality in the Finnish context.

What would settle it

A follow-up annotation study with a larger pool of Finnish speakers rating the same definitions for humor and political content, or direct comparison against published satirical definitions.

Figures

Figures reproduced from arXiv: 2605.10853 by Linyao Du, Ona de Gibert, Oona Itkonen, Yuxin Su.

**Figure 2.** Figure 2: Distribution of absolute quality scores across annotators. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average scores across experimental conditions. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation of human scores with AyaExpanse-8B annotations. topic modeling. The repository of the application will be made public upon acceptance. 7 Conclusions In this study, we presented a novel pipeline that uses RAG for grounded satire generation. We formulated five research questions and evaluated the system through both human annotation and an LLM-as-a-judge framework. Our results show that the gene… view at source ↗

**Figure 5.** Figure 5: Correlation of human scores with EuroLLM-9B-Instruct Annotations [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Correlation of human scores with Llama-3.1-8B-Instruct Annotations [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Correlation of human scores with Mistral-7B-Instruct Annotations [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation of human scores with Qwen2.5-7B-Instruct Annotations [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies RAG to Finnish satire generation with a multi-condition eval, but the six-annotator study lacks the stats needed to back the no-humor-gain claim.

read the letter

The paper shows a RAG-based approach to creating satirical dictionary definitions in Finnish, grounded in current news. The key result is that these definitions rate higher on political relevance than on humor, with topic-based word choice and RAG helping the political side but not clearly boosting humor. What is new here is the application to Finnish satirical content and the evaluation that crosses cultural background, word source, and RAG use. They also compare five LLMs as judges against the human ratings. Sharing the code and the full set of 100 annotated definitions stands out as practical help for follow-up work on grounded creative generation. The evaluation setup has some issues. Relying on six annotators without reported agreement levels or statistical tests means the differences between conditions could be due to individual preferences rather than the methods. The observation that things are more political than humorous might simply come from the nature of satire itself, not from any particular strength or weakness in the pipeline. The abstract does not give enough on how the ratings were collected or controlled, which leaves the main claims only partially supported. Readers working on natural language generation for humor or in specific languages like Finnish would get the most from this. It provides a reusable task and dataset for testing subjective qualities, so people studying evaluation methods for creative tasks could adapt the multi-condition design. The work shows clear thinking in how it structures the experiment and shares resources, so it deserves a serious look from referees. I would send this to peer review. The idea is solid enough and the resources are there, but the authors should expect questions about making the human study more robust with agreement metrics and larger samples.

Referee Report

2 major / 2 minor

Summary. The paper proposes a RAG-based pipeline that retrieves current news to ground the generation of satirical Finnish dictionary definitions. It introduces a task-specific evaluation framework, has six human annotators rate 100 generated definitions across conditions (RAG vs. no RAG, topic-based vs. random word selection, cultural background), and compares the outputs to five LLMs used as judges. The central empirical claims are that the definitions are perceived as more political than humorous, that both topic-based word selection and RAG increase political relevance without clear gains in humor, and that LLM judges correlate well with humans on political relevance but poorly on humor. The code and annotated dataset are released.

Significance. If the human evaluation is shown to be reliable, the work offers a concrete, grounded approach to satire generation in a non-English setting and supplies a public dataset plus evaluation framework that could support follow-on research on context-dependent humor. The release of code and annotations is a clear positive that increases the paper's utility to the community.

major comments (2)

[Human Evaluation] Human Evaluation section (and abstract): the central claims rest on ratings from only six annotators on 100 definitions, yet no inter-annotator agreement statistic (Fleiss' kappa, Krippendorff's alpha, or similar), no annotation guidelines, no breakdown by annotator background or cultural expertise, and no statistical significance tests (paired t-tests, ANOVA, or effect sizes) for the reported differences between conditions are provided. Without these, it is impossible to determine whether the headline finding—that definitions are more political than humorous and that RAG improves political relevance but not humor—reflects systematic effects or annotator idiosyncrasy.
[Results / LLM-as-a-Judge] Results and LLM-as-a-Judge subsection: the statement that LLMs 'correlate well with human judgments on political relevance, but perform poorly on humor' is given without any reported correlation coefficients, confusion matrices, or per-model breakdowns. This leaves the comparative evaluation of the five state-of-the-art models unsupported and prevents readers from assessing how well the LLM judges actually track the human signal that underpins the paper's main conclusions.

minor comments (2)

[Abstract] The abstract refers to 'a new task-specific evaluation framework' without a one-sentence description of its dimensions or scoring procedure; a brief characterization would help readers immediately understand the annotation protocol.
[Results] Figure or table presenting the per-condition means and standard deviations for the political and humorous ratings would make the quantitative claims easier to inspect and would strengthen the results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We believe the suggested additions will improve the clarity and rigor of our evaluation sections. We address each major comment below.

read point-by-point responses

Referee: [Human Evaluation] Human Evaluation section (and abstract): the central claims rest on ratings from only six annotators on 100 definitions, yet no inter-annotator agreement statistic (Fleiss' kappa, Krippendorff's alpha, or similar), no annotation guidelines, no breakdown by annotator background or cultural expertise, and no statistical significance tests (paired t-tests, ANOVA, or effect sizes) for the reported differences between conditions are provided. Without these, it is impossible to determine whether the headline finding—that definitions are more political than humorous and that RAG improves political relevance but not humor—reflects systematic effects or annotator idiosyncrasy.

Authors: We thank the referee for highlighting these important aspects of reporting for human evaluations. Upon reflection, we agree that providing inter-annotator agreement, annotation guidelines, annotator details, and statistical tests will enhance the manuscript. In the revised version, we will add: (1) Krippendorff's alpha for agreement on both political relevance and humor ratings; (2) the full annotation guidelines in an appendix; (3) a table or description of annotator backgrounds, including their cultural expertise in Finnish context; and (4) results of paired t-tests or Wilcoxon tests with effect sizes (Cohen's d) for comparisons between conditions (RAG vs. no RAG, topic-based vs. random). We note that the six annotators were all Finnish natives, which supports the cultural grounding, but we will make this explicit. These changes will allow readers to better evaluate the reliability of our findings. revision: yes
Referee: [Results / LLM-as-a-Judge] Results and LLM-as-a-Judge subsection: the statement that LLMs 'correlate well with human judgments on political relevance, but perform poorly on humor' is given without any reported correlation coefficients, confusion matrices, or per-model breakdowns. This leaves the comparative evaluation of the five state-of-the-art models unsupported and prevents readers from assessing how well the LLM judges actually track the human signal that underpins the paper's main conclusions.

Authors: We agree that the LLM-as-a-judge results require more quantitative support to be fully convincing. We will revise the subsection to include Spearman rank correlation coefficients (or Pearson, depending on the data) between the human ratings and each LLM's ratings for political relevance and humor separately. Additionally, we will provide per-model performance breakdowns and, where suitable, confusion matrices or agreement tables. This will substantiate our claim that LLMs align better with humans on political relevance than on humor and allow for a more transparent comparison of the five models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with external human/LLM judgments

full rationale

The paper presents a RAG-based generation pipeline for Finnish satirical definitions and evaluates outputs via direct human annotation of 100 items by six annotators plus LLM-as-judge comparisons. No mathematical derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. All headline results (political vs. humorous perception, effects of topic selection and RAG) are reported from external ratings rather than self-referential logic, self-citations, or renamed empirical patterns. The study is self-contained against its own annotation data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an applied empirical study in NLP that relies on standard RAG components and human evaluation protocols without introducing new free parameters, axioms, or invented entities beyond the experimental setup.

pith-pipeline@v0.9.0 · 5485 in / 1158 out tokens · 86104 ms · 2026-05-12T03:55:17.116141+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions... annotate 100 generated definitions with six human annotators
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

arXiv preprint arXiv:2511.09133 , year=

Assessing the Capabilities of LLMs in Humor: A Multi-dimensional Analysis of Oogiri Generation and Evaluation , author=. arXiv preprint arXiv:2511.09133 , year=

work page arXiv
[9]

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

Horvitz, Zachary and Chen, Jingru and Aditya, Rahul and Srivastava, Harshvardhan and West, Robert and Yu, Zhou and McKeown, Kathleen. Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.acl-short.76 2024
[10]

Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLM s

Ozturk, Asli Umay and Cekinel, Recep Firat and Karagoz, Pinar. Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLM s. Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC). 2025

work page 2025
[11]

International Conference on Applications of Natural Language to Information Systems , pages=

Cross-domain and cross-language irony detection: The impact of bias on models’ generalization , author=. International Conference on Applications of Natural Language to Information Systems , pages=. 2023 , organization=

work page 2023
[12]

Journal of Ambient Intelligence and Humanized Computing , volume=

Transformer-based models for multimodal irony detection , author=. Journal of Ambient Intelligence and Humanized Computing , volume=. 2023 , publisher=

work page 2023
[13]

Workshop on Chinese Lexical Semantics , pages=

Augmenting emotion features in irony detection with large language modeling , author=. Workshop on Chinese Lexical Semantics , pages=. 2024 , organization=

work page 2024
[14]

Bulletin of the Transilvania University of Bra

Evaluating AI-Generated Satire against Human-Written Content: A Comparative Analysis , author=. Bulletin of the Transilvania University of Bra

work page
[15]

Comedy: A geographic and historical guide , volume=

The philosophy of humor , author=. Comedy: A geographic and historical guide , volume=. 2005 , publisher=

work page 2005
[16]

and Lee, Lillian and Da, Jeff and Zellers, Rowan and Mankoff, Robert and Choi, Yejin

Hessel, Jack and Marasovic, Ana and Hwang, Jena D. and Lee, Lillian and Da, Jeff and Zellers, Rowan and Mankoff, Robert and Choi, Yejin. Do Androids Laugh at Electric Sheep? Humor ``Understanding'' Benchmarks from The New Yorker Caption Contest. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

work page doi:10.18653/v1/2023.acl-long.41 2023
[17]

Proceedings of the ACL Interactive Poster and Demonstration Sessions , pages=

Hahacronym: A computational humor system , author=. Proceedings of the ACL Interactive Poster and Demonstration Sessions , pages=

work page
[18]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

BottleHumor: Self-informed humor explanation using the information bottleneck principle , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[19]

Retrieved on December , volume=

Humor detection in yelp reviews , author=. Retrieved on December , volume=

work page
[20]

Authorea Preprints , year=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. Authorea Preprints , year=

work page
[21]

2412.04261 , archiveprefix =

Aya expanse: Combining research breakthroughs for a new multilingual frontier , author=. arXiv preprint arXiv:2412.04261 , year=

work page arXiv
[22]

Procedia Computer Science , volume=

Eurollm: Multilingual language models for europe , author=. Procedia Computer Science , volume=. 2025 , publisher=

work page 2025
[23]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[24]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Humor Recognition Using Deep Learning

Chen, Peng-Yu and Soo, Von-Wun. Humor Recognition Using Deep Learning. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2018

work page doi:10.18653/v1/n18-2018 2018
[26]

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics , pages=

From punchlines to predictions: A metric to assess llm performance in identifying humor in stand-up comedy , author=. Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics , pages=

work page
[27]

Second Workshop on Language Models for Underserved Communities (LM4UC) , year=

Not Funny Anymore: LLM Judges Confuse Literal Similarity for Humor in Translated Jokes , author=. Second Workshop on Language Models for Underserved Communities (LM4UC) , year=

work page
[28]

Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025) , pages=

Few-shot prompting, full-scale confusion: Evaluating large language models for humor detection in croatian tweets , author=. Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025) , pages=

work page 2025
[29]

You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models

Baranov, Alexander and Kniazhevsky, Vladimir and Braslavski, Pavel. You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.845

work page doi:10.18653/v1/2023.emnlp-main.845 2023
[30]

Oxford Research Encyclopedia of Literature , year=

Satire , author=. Oxford Research Encyclopedia of Literature , year=

work page
[31]

The Routledge handbook of language and humor , pages=

An overview of humor theory , author=. The Routledge handbook of language and humor , pages=. 2017 , publisher=

work page 2017
[32]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

We are humor beings: Understanding and predicting visual humor , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[33]

paper on computational humor accepted despite making serious advances

Reverse-engineering satire, or “paper on computational humor accepted despite making serious advances” , author=. Proceedings of the aaai conference on artificial intelligence , volume=

work page
[34]

Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers

H. Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers. Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM). 2021. doi:10.18653/v1/2021.gem-1.9

work page doi:10.18653/v1/2021.gem-1.9 2021
[35]

C hat GPT is fun, but it is not funny! Humor is still challenging Large Language Models

Jentzsch, Sophie and Kersting, Kristian. C hat GPT is fun, but it is not funny! Humor is still challenging Large Language Models. Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis. 2023. doi:10.18653/v1/2023.wassa-1.29

work page doi:10.18653/v1/2023.wassa-1.29 2023
[36]

2023 , publisher =

OpenAI , title =. 2023 , publisher =

work page 2023
[37]

2020 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2020 , eprint=

work page 2020
[38]

2020 , eprint=

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , author=. 2020 , eprint=

work page 2020
[39]

2022 , eprint=

BERTopic: Neural topic modeling with a class-based TF-IDF procedure , author=. 2022 , eprint=

work page 2022