pith. machine review for the scientific record. sign in

arxiv: 2604.17621 · v1 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM evaluationknowledge coveragecompositional reasoningbenchmarkinguniverse enumerationtip of the icebergopen source modelsknowledge grounded reasoning
0
0 comments X

The pith

Current LLMs show severe limits in covering full knowledge universes and composing set-based reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

KnowledgeBerg is introduced as a benchmark to assess LLMs' ability to handle deceptively simple questions that require systematic coverage of a bounded knowledge universe and compositional reasoning over it. Representative open-source models score only 5.26 to 36.88 F1 on enumeration and 16.00 to 44.19 accuracy on reasoning tasks across 10 domains and 17 languages. The analysis identifies three failure stages—completeness, awareness, and application—that persist even when using test-time compute or retrieval augmentation. This exposes how current LLMs fall short in organizing structured knowledge for bounded domains.

Core claim

The paper formalizes the tip of the iceberg challenge using knowledge width as the size of the required knowledge universe and reasoning depth as the number of set operations needed. It presents KnowledgeBerg with 4800 questions from 1183 seeds, showing LLMs fail at complete enumeration and correct application of reasoning in three stages that hold across languages and scales, with only limited improvement from augmentations.

What carries the argument

KnowledgeBerg benchmark of multiple-choice questions derived from enumeration seeds grounded in authoritative sources, designed to probe failures in knowledge completeness, awareness of requirements, and application of reasoning.

Load-bearing premise

The enumeration seeds and questions accurately reflect real-world requirements for systematic knowledge coverage and compositional reasoning without significant design biases or incomplete sourcing.

What would settle it

If a future LLM achieves F1 scores above 80 on universe enumeration and accuracy above 80 on the reasoning questions while maintaining consistency across domains, the claim of severe limitations would be challenged.

Figures

Figures reproduced from arXiv: 2604.17621 by Johan Bos, Qianru Meng, Xiao Zhang, Yongjian Chen, Yumeng Wang.

Figure 1
Figure 1. Figure 1: Illustration of the tip-of-the-iceberg phe [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Iceberg Gap across benchmarks (mean ± 95% CI), estimated from N = 500 items sampled uniformly at random per benchmark. set Ω by mapping each raw value to its pooled percentile (mid-rank for ties), yielding scores in (0, 1) (Appendix B). Within-benchmark normaliza￾tion is intentionally avoided: a raw width of 10 may be extreme in a narrow benchmark but typi￾cal in a broader one, so benchmark-relative scalin… view at source ↗
Figure 3
Figure 3. Figure 3: KRQ accuracy versus enumeration quality (Universe F1), with instances binned by F1 and ag￾gregated across models. Instance-level correlations are near-zero (Spearman ρ = 0.0023; Kendall τ = 0.0020). More importantly, the two metrics are not aligned. Some models with relatively strong enu￾meration completeness perform poorly on KRQs; for example, Mistral-Small-24B reaches 33.26 Uni￾verse F1 but only 19.88 K… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of diagnostic prompt variants on KRQ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, designed to test LLMs on systematic coverage of bounded knowledge universes (knowledge width) and compositional set-based reasoning (reasoning depth), termed the 'tip of the iceberg' phenomenon. It reports that representative open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning, with diagnostic analyses identifying three failure stages (completeness, awareness, application) that persist across languages and scales; test-time compute and RAG yield modest gains (up to 4.35 and 3.78 points) but gaps remain. The dataset is publicly released on Hugging Face.

Significance. If the benchmark seeds and questions validly require exhaustive universe coverage plus compositional reasoning without permitting partial-knowledge shortcuts or containing selection/grounding artifacts, the results would highlight important architectural limitations in how LLMs organize structured knowledge and perform set operations over bounded domains. The multilingual coverage, public dataset release, and explicit formalization of width/depth dimensions would strengthen the contribution to LLM evaluation research.

major comments (3)
  1. [Benchmark Construction] Benchmark Construction section: The claim that universes are 'grounded in authoritative sources to ensure reproducibility' is load-bearing for interpreting the low F1/accuracy scores as evidence of 'severe limitations' in LLMs, yet the manuscript provides no explicit protocol for exhaustive enumeration of the 1,183 seeds, coverage verification, or controls against incomplete domains and linguistic cues that might allow partial-knowledge solutions. This directly affects whether the three failure stages and performance gaps support the central interpretation.
  2. [Diagnostic Analyses] Diagnostic Analyses section: The distinction among the three failure stages (completeness, awareness, application) is presented as a key finding but lacks a clear operationalization or quantitative criteria for classifying model outputs into these categories; without inter-annotator agreement, example traces, or metrics showing how stages are separated, the diagnostic claims cannot be fully assessed.
  3. [Experimental Results] Experimental Results section: The reported gains from test-time compute (up to 4.35 points) and RAG (up to 3.78 points) are used to argue that 'substantial gaps remain,' but without details on exact prompting methods, statistical significance testing, or comparison to stronger baselines, it is unclear whether these improvements are robust or if the headline limitations are overstated.
minor comments (2)
  1. [Abstract] Abstract: The terms 'knowledge width' and 'reasoning depth' are introduced without a brief inline definition, which may reduce accessibility for readers encountering the formalization for the first time.
  2. [Dataset Description] The manuscript would benefit from an explicit statement of the number of questions per domain/language split and any filtering criteria applied to the 4,800 MCQs to ensure balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification. We will revise the manuscript to strengthen the presentation of benchmark construction, diagnostic analyses, and experimental details. Our point-by-point responses follow, with planned revisions indicated.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: The claim that universes are 'grounded in authoritative sources to ensure reproducibility' is load-bearing for interpreting the low F1/accuracy scores as evidence of 'severe limitations' in LLMs, yet the manuscript provides no explicit protocol for exhaustive enumeration of the 1,183 seeds, coverage verification, or controls against incomplete domains and linguistic cues that might allow partial-knowledge solutions. This directly affects whether the three failure stages and performance gaps support the central interpretation.

    Authors: We agree that explicit details on the enumeration protocol are needed to support the grounding claim. In the revised manuscript, we will expand the Benchmark Construction section with a new subsection describing: (1) the authoritative sources for each domain (e.g., official government databases for demographics, comprehensive encyclopedias for geography and history); (2) the multi-source cross-verification process and manual expert review to confirm exhaustiveness of the 1,183 seeds; and (3) controls such as option balancing and paraphrasing to reduce linguistic shortcuts. These additions will allow readers to evaluate whether partial-knowledge solutions are feasible and will reinforce that the observed failures reflect genuine limitations in coverage and reasoning. revision: yes

  2. Referee: [Diagnostic Analyses] Diagnostic Analyses section: The distinction among the three failure stages (completeness, awareness, application) is presented as a key finding but lacks a clear operationalization or quantitative criteria for classifying model outputs into these categories; without inter-annotator agreement, example traces, or metrics showing how stages are separated, the diagnostic claims cannot be fully assessed.

    Authors: We acknowledge the need for clearer operationalization. In the revision, we will add formal definitions in the Diagnostic Analyses section: completeness as omission of required universe elements in enumeration tasks; awareness as failure to recognize the need for exhaustive coverage despite explicit prompts; and application as errors in executing set operations (e.g., union, intersection) even when elements are known. We will include 5-6 representative model output traces per stage, along with inter-annotator agreement metrics (Cohen's kappa) from two annotators on a sample of 200 responses. This will provide quantitative grounding for the stage distinctions and their persistence across languages and scales. revision: yes

  3. Referee: [Experimental Results] Experimental Results section: The reported gains from test-time compute (up to 4.35 points) and RAG (up to 3.78 points) are used to argue that 'substantial gaps remain,' but without details on exact prompting methods, statistical significance testing, or comparison to stronger baselines, it is unclear whether these improvements are robust or if the headline limitations are overstated.

    Authors: We agree that additional methodological transparency is required. In the revised Experimental Results section, we will specify the exact test-time compute prompting (chain-of-thought with explicit enumeration instructions and self-consistency sampling) and RAG implementation (domain-specific retrieval from authoritative corpora with top-k integration). We will report statistical significance via paired t-tests with p-values against base models and add comparisons to stronger baselines including 5-shot prompting with domain exemplars. These details will show that the modest gains (4.35 and 3.78 points) are statistically reliable yet insufficient to close the gaps, supporting our interpretation of architectural limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed benchmark seeds

full rationale

The paper constructs KnowledgeBerg as a fixed set of 1,183 enumeration seeds and 4,800 derived MCQs grounded in external authoritative sources, then reports direct performance measurements (F1 and accuracy) on open-source LLMs. No equations, fitted parameters, predictions derived from model fits, or self-citations appear in the derivation chain. The three failure stages are post-hoc diagnostic interpretations of the observed scores rather than inputs that define the results. The evaluation is self-contained against the released dataset and does not reduce any claim to a renaming, ansatz, or self-referential uniqueness theorem.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Benchmark relies on author-selected domains, languages, and 1183 enumeration seeds as scope parameters; assumes authoritative sources yield complete, unbiased universes for reproducibility.

free parameters (2)
  • Number of domains = 10
    Selected 10 domains to span coverage; choice affects measured width.
  • Number of languages = 17
    Chosen 17 languages for multilingual testing; selection impacts generalizability claims.
axioms (1)
  • domain assumption Universes grounded in authoritative sources ensure reproducibility and validity of knowledge coverage tests
    Invoked to justify seed selection and question grounding without further validation in abstract.

pith-pipeline@v0.9.0 · 5547 in / 1258 out tokens · 39919 ms · 2026-05-10T05:34:11.384982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    2025 , eprint=

    OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning , author=. 2025 , eprint=

  9. [9]

    Industrial applications of large language models , volume =

    Raza, Mubashar and Jahangir, Zarmina and Riaz, Muhammad and Sattar, Muhammad , year =. Industrial applications of large language models , volume =. Scientific Reports , doi =

  10. [10]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  11. [11]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , title =. Proceedings of the 38th International Conference on Neural Information Proc...

  12. [12]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  13. [13]

    The Twelfth International Conference on Learning Representations , year=

    Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

  14. [14]

    m C o T : Multilingual Instruction Tuning for Reasoning Consistency in Language Models

    Lai, Huiyuan and Nissim, Malvina. m C o T : Multilingual Instruction Tuning for Reasoning Consistency in Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.649

  15. [15]

    Nature Genetics , year=

    Gene Ontology: tool for the unification of biology , author=. Nature Genetics , year=

  16. [16]

    NCBI Taxonomy: A comprehensive update on curation, resources and tools , volume =

    Schoch, Conrad and Ciufo, Stacy and Hotton, Carol and Kannan, Sivakumar and Khovanskaya, Rogneda and Leipe, Detlef and McVeigh, Richard and O’Neill, Kathleen and Robbertse, Barbara and Sharma, Shobha and Soussov, Vladimir and Sullivan, John and Sun, Lu and Turner, Sean and Karsch-Mizrachi, Ilene , year =. NCBI Taxonomy: A comprehensive update on curation,...

  17. [17]

    2013 , publisher=

    Fundamentals of physics , author=. 2013 , publisher=

  18. [18]

    2010 , publisher=

    Calculus: Concepts and contexts , author=. 2010 , publisher=

  19. [19]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  20. [20]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  21. [21]

    Gemma 3 Technical Report

    Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

  22. [22]

    Applied Sciences , volume=

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

  23. [23]

    Crowdsourcing Multiple Choice Science Questions

    Crowdsourcing multiple choice science questions , author=. arXiv preprint arXiv:1707.06209 , year=

  24. [24]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

  25. [25]

    ArXiv , year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

  26. [26]

    SQuAD : 100,000+ questions for machine comprehension of text

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

  27. [27]

    KILT : a benchmark for knowledge intensive language tasks

    Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt. KILT : a Benchmark for Knowledge Intensive Language Tasks. Proceedings of the 2021 Conference of the North American Chapter of th...

  28. [28]

    , booktitle =

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

  29. [29]

    doi: 10.18653/v1/N19-1421

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

  30. [30]

    H ella S wag: Can a Machine Really Finish Your Sentence?

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

  31. [31]

    doi: 10.18653/v1/2020.emnlp-main.466

    Min, Sewon and Michael, Julian and Hajishirzi, Hannaneh and Zettlemoyer, Luke. A mbig QA : Answering Ambiguous Open-domain Questions. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.466

  32. [32]

    doi:10.18653/v1/P19-1346 , pages =

    Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael. ELI 5: Long Form Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1346

  33. [33]

    ASQA: Factoid questions meet long- form answers

    Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei. ASQA : Factoid Questions Meet Long-Form Answers. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.566

  34. [34]

    P roxy QA : An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

    Tan, Haochen and Guo, Zhijiang and Shi, Zhan and Xu, Lu and Liu, Zhili and Feng, Yunlong and Li, Xiaoguang and Wang, Yasheng and Shang, Lifeng and Liu, Qun and Song, Linqi. P roxy QA : An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

  35. [35]

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=

    Setexpan: Corpus-based set expansion via context feature selection and rank ensemble , author=. Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=. 2017 , organization=

  36. [36]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Ultra-Fine Entity Typing , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  37. [37]

    Advances in Neural Information Processing Systems , editor=

    Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  38. [38]

    2023 , URL =

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. 2023 , URL =

  39. [39]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

  40. [40]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  41. [41]

    Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

  42. [42]

    2024 , publisher=

    Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. 2024 , publisher=

  43. [43]

    Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

  44. [44]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  45. [45]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  46. [46]

    N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

    Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065

  47. [47]

    MEB ench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

    Lin, Teng and Luo, Yuyu and Zhang, Honglin and Zhang, Jicheng and Liu, Chunlin and Wu, Kaishun and Tang, Nan. MEB ench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.77

  48. [48]

    https://aclanthology.org/ Q19-1026/

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  49. [49]

    doi: 10.18653/v1/N18-1059

    Talmor, Alon and Berant, Jonathan. The Web as a Knowledge-Base for Answering Complex Questions. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1059

  50. [50]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

  51. [51]

    Large Language Models are Better Reasoners with Self-Verification

    Weng, Yixuan and Zhu, Minjun and Xia, Fei and Li, Bin and He, Shizhu and Liu, Shengping and Sun, Bin and Liu, Kang and Zhao, Jun. Large Language Models are Better Reasoners with Self-Verification. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.167

  52. [52]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  53. [53]

    Codd, E. F. , title =. Commun. ACM , month = jun, pages =. 1970 , issue_date =. doi:10.1145/362384.362685 , abstract =

  54. [54]

    Philosophy, language, and artificial intelligence: Resources for processing natural language , pages=

    Generalized quantifiers and natural language , author=. Philosophy, language, and artificial intelligence: Resources for processing natural language , pages=. 1981 , publisher=

  55. [55]

    and Song, Le , title =

    Zhang, Yuyu and Dai, Hanjun and Kozareva, Zornitsa and Smola, Alexander J. and Song, Le , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =

  56. [56]

    International conference on machine learning , pages=

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International conference on machine learning , pages=. 2018 , organization=

  57. [57]

    International Conference on Learning Representations , year=

    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , author=. International Conference on Learning Representations , year=

  58. [58]

    2025 , eprint=

    Multidimensional Consistency Improves Reasoning in Language Models , author=. 2025 , eprint=

  59. [59]

    Semantic Parsing on F reebase from Question-Answer Pairs

    Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy. Semantic Parsing on F reebase from Question-Answer Pairs. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013

  60. [60]

    KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base

    Cao, Shulin and Shi, Jiaxin and Pan, Liangming and Nie, Lunyiu and Xiang, Yutong and Hou, Lei and Li, Juanzi and He, Bin and Zhang, Hanwang. KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  61. [61]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Lu, Pan and Peng, Baolin and Cheng, Hao and Galley, Michel and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Gao, Jianfeng , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  62. [62]

    Understanding and Patching Compositional Reasoning in LLM s

    Li, Zhaoyi and Jiang, Gangwei and Xie, Hong and Song, Linqi and Lian, Defu and Wei, Ying. Understanding and Patching Compositional Reasoning in LLM s. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.576

  63. [63]

    Gabburo, Matteo and Jedema, Nicolaas Paul and Garg, Siddhant and Ribeiro, Leonardo F. R. and Moschitti, Alessandro. Measuring Retrieval Complexity in Question Answering Systems. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.872

  64. [64]

    2024 , eprint=

    RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation , author=. 2024 , eprint=

  65. [65]

    The Twelfth International Conference on Learning Representations , year=

    Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=