arxiv: 2605.00072 · v1 · submitted 2026-04-30 · 💻 cs.CR · cs.AI

Recognition: unknown

XekRung Technical Report

Jiutian Zeng , Junjie Li , Chengwei Dai , Jie Liang , Zhaoyu Hu , Yiliang Zhang , Ziang Weng , Longtao Huang

show 8 more authors

Dongjie Zhang Libin Dong Yang Ge Yuanda Wang Kaiwen Lv Kacuila Bingyu Zhu Jing Wang Jin Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords large language modelcybersecuritydata synthesiscontinued pre-trainingsupervised fine-tuningreinforcement learningmodel evaluation

0 comments

The pith

XekRung is a large language model built for cybersecurity that reaches top scores on security benchmarks while holding strong general performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present XekRung as a frontier-scale model specialized in cybersecurity. They construct it by first creating large volumes of domain-specific training data through custom synthesis pipelines, then applying a staged training process of continued pre-training, supervised fine-tuning, and reinforcement learning. A multi-dimensional evaluation system tracks progress across both security tasks and ordinary capabilities. If the approach works, it demonstrates that targeted data generation and phased training can produce models that outperform general-purpose ones inside a narrow but high-stakes domain without losing broad usefulness.

Core claim

XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale, while maintaining strong performance on general benchmarks, through the use of diverse data synthesis pipelines for scalable high-quality training data and a complete training pipeline spanning continued pre-training, supervised fine-tuning, and reinforcement learning, all guided by a multi-dimensional evaluation system for iterative improvement.

What carries the argument

The combination of cybersecurity-tailored data synthesis pipelines and the multi-dimensional evaluation system, which together supply the data foundation and the feedback loop for extending both domain and general abilities.

Load-bearing premise

The custom data synthesis pipelines generate training data that reflects genuine, representative cybersecurity knowledge rather than patterns that merely match the chosen evaluation benchmarks.

What would settle it

A sharp drop in accuracy on a fresh collection of cybersecurity problems assembled independently after training and containing no overlap with the synthesized data patterns.

Figures

Figures reproduced from arXiv: 2605.00072 by Bingyu Zhu, Chengwei Dai, Dongjie Zhang, Jie Liang, Jing Wang, Jin Xu, Jiutian Zeng, Junjie Li, Kaiwen Lv Kacuila, Libin Dong, Longtao Huang, Yang Ge, Yiliang Zhang, Yuanda Wang, Zhaoyu Hu, Ziang Weng.

**Figure 2.** Figure 2: Overview of the post-training pipeline. The framework comprises two complementary [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

We present XekRung, a frontier large language model for cybersecurity, designed to provide comprehensive security capabilities. To achieve this, we develop diverse data synthesis pipelines tailored to the cybersecurity domain, enabling the scalable construction of high-quality training data and providing a strong foundation for cybersecurity knowledge and understanding. Building on this foundation, we establish a complete training pipeline spanning continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) to further extend the model's capabilities. We further introduce a multi-dimensional evaluation system to guide the iterative improvement of both domain-specific and general-purpose abilities. Extensive experiments demonstrate that XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale, while maintaining strong performance on general benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XekRung is a standard domain-adapted LLM release for cybersecurity using CPT + SFT + RL on synthetic data, with SOTA claims that rest on custom evals and lack concrete benchmarks or numbers.

read the letter

This is a technical report releasing XekRung, a large language model tuned for cybersecurity. The authors run the usual pipeline—continued pre-training on domain data, supervised fine-tuning, and reinforcement learning—after building their own data synthesis tools and a multi-dimensional evaluation system. Nothing in the method steps outside established LLM adaptation practices. What stands out is the complete documentation of how they generated cybersecurity-specific data and tracked both domain and general capabilities during training. That level of process detail can be useful for teams trying to adapt models to narrow applied fields like security. The main weakness is the performance section. The abstract asserts state-of-the-art results on cybersecurity benchmarks at the model's scale and solid general performance, yet supplies no benchmark names, scores, baselines, or error bars. Because the training data and evaluation framework are both internal, the risk of circular measurement is real: the model may simply look strong on tests shaped by the same synthesis process. Without independent public benchmarks or side-by-side numbers, it is hard to tell whether this is genuine progress or expected fine-tuning behavior. The paper is aimed at practitioners who want a ready cybersecurity LLM or at researchers running similar domain-adaptation experiments. It offers no new theoretical insight or algorithmic advance, so it will not interest readers focused on core LLM research or formal security proofs. If this were submitted to an applied AI or security systems venue, it would deserve peer review. Reviewers could require the missing experimental specifics and check whether the custom metrics actually predict real-world utility. The work is internally consistent as a model release, but the evidence for its headline claims needs strengthening before it can be taken as more than an engineering artifact.

Referee Report

2 major / 0 minor

Summary. The manuscript presents XekRung, a frontier-scale LLM specialized for cybersecurity. It describes domain-tailored data synthesis pipelines for scalable high-quality training data, a staged training pipeline (continued pre-training, supervised fine-tuning, and reinforcement learning), and a custom multi-dimensional evaluation system intended to balance domain-specific and general capabilities. The central empirical claim is that XekRung attains state-of-the-art results on cybersecurity benchmarks among models of comparable scale while retaining strong performance on general-purpose benchmarks.

Significance. If the performance claims were substantiated with transparent, reproducible benchmarks and baselines, the work would be significant as a demonstration of effective domain adaptation via synthetic data and staged training for security applications. The described pipeline and evaluation framework could inform future domain-specific LLM development, but the current lack of verifiable experimental details prevents assessment of whether genuine capability gains or benchmark-specific optimizations are achieved.

major comments (2)

Abstract: The claim that 'XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale' is unsupported by any benchmark names, numerical scores, baseline comparisons, error bars, or methodological details. This absence directly undermines the central empirical contribution, as the SOTA assertion cannot be evaluated or reproduced from the presented material.
The multi-dimensional evaluation system and data synthesis pipelines are described as custom and self-developed, yet no independent external benchmarks or ablation studies are referenced to demonstrate that reported gains are not reducible to metrics defined or influenced by the training process itself. This creates a circularity risk for the performance claims that requires explicit external validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the transparency and verifiability of our empirical claims. We address each major comment below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: Abstract: The claim that 'XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale' is unsupported by any benchmark names, numerical scores, baseline comparisons, error bars, or methodological details. This absence directly undermines the central empirical contribution, as the SOTA assertion cannot be evaluated or reproduced from the presented material.

Authors: We agree that the abstract would benefit from greater specificity to allow immediate evaluation of the central claim. In the revised manuscript, we will expand the abstract to explicitly name the cybersecurity benchmarks used, report key numerical scores with comparisons to same-scale baselines, and briefly reference the evaluation methodology. The full paper already contains detailed tables and results sections with these elements; we will ensure they are cross-referenced in the abstract for reproducibility. revision: yes
Referee: The multi-dimensional evaluation system and data synthesis pipelines are described as custom and self-developed, yet no independent external benchmarks or ablation studies are referenced to demonstrate that reported gains are not reducible to metrics defined or influenced by the training process itself. This creates a circularity risk for the performance claims that requires explicit external validation.

Authors: We acknowledge the risk of circularity when relying primarily on custom pipelines and evaluation frameworks. To mitigate this, the revised manuscript will include explicit references to independent, widely adopted external benchmarks (both cybersecurity-specific and general-purpose) alongside our multi-dimensional system. We will also add ablation studies that isolate the effects of the data synthesis pipelines and each training stage (CPT, SFT, RL), demonstrating that gains are not artifacts of self-defined metrics. These additions will be placed in the experiments and evaluation sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results

full rationale

The paper presents an empirical technical report on training XekRung via a standard CPT+SFT+RL pipeline on custom synthetic cybersecurity data, followed by reporting performance on benchmarks using a multi-dimensional evaluation system. No equations, derivations, or first-principles predictions are described that reduce to inputs by construction. The SOTA claim is framed as an observed outcome of the pipeline on external benchmarks rather than a fitted or self-defined quantity, and no load-bearing self-citations or ansatzes are invoked to justify core results. The derivation chain is therefore self-contained as an engineering description without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger captures high-level assumptions stated or implied; many concrete details such as model scale, data volumes, specific benchmarks, and training hyperparameters remain unknown and unaccounted for.

axioms (2)

domain assumption Domain-tailored data synthesis pipelines can generate high-quality training data that improves LLM performance on cybersecurity tasks
Invoked as the foundation for scalable high-quality data construction in the abstract.
domain assumption A complete pipeline of continued pre-training, supervised fine-tuning, and reinforcement learning extends both domain-specific and general capabilities
Assumed in the description of the training process used to build the model.

pith-pipeline@v0.9.0 · 5469 in / 1506 out tokens · 35245 ms · 2026-05-09T20:51:51.075201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

179 extracted references · 57 canonical work pages · 18 internal anchors

[1]

2026 , url=

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. 2026 , url=

2026
[2]

2026 , url =

Assessing. 2026 , url =

2026
[3]

2026 , url =

Scaling Trusted Access for the Next Era of Cyber Defense , author =. 2026 , url =

2026
[6]

Grattafiori, Aaron and others , journal =. The. 2024 , url =

2024
[7]

arXiv preprint arXiv:2402.16968 , year =

A Survey of Large Language Models in Cybersecurity , author =. arXiv preprint arXiv:2402.16968 , year =

work page arXiv
[14]

arXiv preprint arXiv:2406.12793 , year =

work page internal anchor Pith review arXiv
[15]

2024 , url =

Jaech, Aaron and others , journal =. 2024 , url =

2024
[16]

2025 , url =

Guo, Daya and others , journal =. 2025 , url =

2025
[17]

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W

Phi-4-Reasoning Technical Report , author=. arXiv preprint arXiv:2504.21318 , year =

work page arXiv
[18]

Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

Llama-Nemotron: Efficient Reasoning Models , author =. arXiv preprint arXiv:2505.00949 , year =

work page arXiv
[19]

2024 , note=

Shao, Zhihong and others , journal =. 2024 , note=

2024
[20]

2025 , url =

Yu, Qiying and others , journal =. 2025 , url =

2025
[21]

2025 , url =

Kassianik, Paul and others , journal =. 2025 , url =

2025
[23]

2025 , url =

Lily-Cybersecurity-7B-v0.2 , author =. 2025 , url =

2025
[24]

Jiang, Wenjia and others , booktitle =
[25]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023
[26]

2025 , url =

Tihanyi, Norbert and others , journal =. 2025 , url =

2025
[27]

2026 , url =

Yang, Zichao and others , journal =. 2026 , url =

2026
[28]

2024 , url =

ZySec-7B , author =. 2024 , url =

2024
[29]

2025 , note =

SecGPT: The World's First Open-Source Large Language Model for Cybersecurity , author =. 2025 , note =

2025
[30]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

Guilherme Penedo and Hynek Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =. 2024 , url =

2024
[31]

2026 , url =

The Next Evolution Toward Intelligent Editing:. 2026 , url =

2026
[32]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review arXiv
[33]

Datacomp- LM : In search of the next generation of training sets for language models

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , author=. arXiv preprint arXiv:2406.11794 , year=

work page arXiv
[34]

Advances in Neural Information Processing Systems , year=

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. Advances in Neural Information Processing Systems , year=
[35]

The Fourteenth International Conference on Learning Representations , year=

RedSage: A Cybersecurity Generalist LLM , author=. The Fourteenth International Conference on Learning Representations , year=
[38]

Proceedings of the Twelfth International Conference on Learning Representations (ICLR) , year=

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning , author=. Proceedings of the Twelfth International Conference on Learning Representations (ICLR) , year=
[39]

Advances in Neural Information Processing Systems (NeurIPS) , year=

SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[40]

Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year=

NExtLong: Toward Effective Long-Context Training without Long Documents , author=. Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year=
[42]

RULER: What's the Real Context Size of Your Long-Context Language Models?

RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. arXiv preprint arXiv:2404.06654 , year=

work page internal anchor Pith review arXiv
[43]

arXiv preprint arXiv:2502.12583 , year=

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data , author=. arXiv preprint arXiv:2502.12583 , year=

work page arXiv
[45]

Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year=

SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction , author=. Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year=
[46]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review arXiv
[47]

arXiv preprint arXiv:2508.07382 , year=

Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning , author=. arXiv preprint arXiv:2508.07382 , year=

work page arXiv
[48]

Alam, Md Tanvirul and Bhusal, Dipkamal and Nguyen, Le and Rastogi, Nidhi , booktitle =
[49]

Wei, Zichao and Zeng, Jun and Wen, Ming and Yu, Zeliang and Cheng, Kai and Zhu, Yiding and Guo, Jingyi and Zhou, Shiqi and Yin, Le and Su, Xiaodong and Ma, Zhechao , journal =
[50]

Yang, John and Prabhakar, Akshara and Narasimhan, Karthik and Yao, Shunyu , booktitle =
[51]

Shao, Minghao and Jancheska, Sofija and Udeshi, Meet and Dolan-Gavitt, Brendan and Xi, Haoran and Milner, Kimberly and Chen, Boyuan and Yin, Max and Garg, Siddharth and Krishnamurthy, Prashanth and Khorrami, Farshad and Karri, Ramesh and Shafique, Muhammad , booktitle =
[52]

Yu, Zhengmin and Zeng, Jiutian and Chen, Siyi and Xu, Wenhan and Xu, Dandan and Liu, Xiangyu and Ying, Zonghao and Wang, Nan and Zhang, Yuan and Yang, Min , journal =
[53]

Tihanyi, Norbert and Ferrag, Mohamed Amine and Jain, Ridhi and Bisztray, Tamas and Debbah, Merouane , booktitle =
[54]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =
[55]

Li, Guancheng and Li, Yifeng and Wang, Guannan and Yang, Haoyu and Yu, Yang , year =
[57]

2025 , url =

Yang, Jingyi and others , journal =. 2025 , url =

2025
[58]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models , author =. arXiv preprint arXiv:2408.08926 , year =

work page arXiv
[59]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =
[61]

Less is Enough: Synthesizing Diverse Data in Feature Space of

Li, Zhongzhi and Wu, Xuansheng and Li, Yijiang and Hu, Lijie and Liu, Ninghao , journal =. Less is Enough: Synthesizing Diverse Data in Feature Space of
[63]

To Code or Not To Code? Exploring Impact of Code in Pre-training , booktitle =

Viraat Aryabumi and Yixuan Su and Raymond Ma and Adrien Morisot and Ivan Zhang and Acyr Locatelli and Marzieh Fadaee and Ahmet. To Code or Not To Code? Exploring Impact of Code in Pre-training , booktitle =. 2025 , url =

2025
[64]

International Conference on Machine Learning,

Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

2023
[66]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.03300 , eprinttype =. 2402.03300 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[67]

Wang, Aozhe and others , journal =
[68]

Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD) , year =

Gradient Similarity Surgery in Multi-Task Deep Learning , author =. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD) , year =
[69]

Gradient Surgery for Multi-Task Learning , booktitle =

Tianhe Yu and Saurabh Kumar and Abhishek Gupta and Sergey Levine and Karol Hausman and Chelsea Finn , editor =. Gradient Surgery for Multi-Task Learning , booktitle =. 2020 , url =

2020
[70]

Lu, Yiyang and He, Yu and Chen, Jianlong and Zha, Hongyuan , journal =
[71]

Prioritized Experience Replay , booktitle =

Tom Schaul and John Quan and Ioannis Antonoglou and David Silver , editor =. Prioritized Experience Replay , booktitle =. 2016 , url =

2016
[74]

2025 , volume =

Wang, Guanghui and Yang, Zhiyong and Wang, Zitai and Wang, Shi and Xu, Qianqian and Huang, Qingming , booktitle =. 2025 , volume =

2025
[75]

The Twelfth International Conference on Learning Representations , year =

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author =. The Twelfth International Conference on Learning Representations , year =
[76]

International Conference on Machine Learning (ICML) , pages =

Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time , author =. International Conference on Machine Learning (ICML) , pages =
[77]

Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =
[78]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review arXiv
[79]

, booktitle =

Wei, Yuxiang and Duchenne, Olivier and Copet, Jade and Carbonneaux, Quentin and Zhang, Lingming and Fried, Daniel and Synnaeve, Gabriel and Singh, Rishabh and Wang, Sida I. , booktitle =
[80]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[81]

Proceedings of the ACM on Software Engineering , volume=

Demystifying llm-based software engineering agents , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025
[82]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[83]

The Twelfth International Conference on Learning Representations , year =

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions , author =. The Twelfth International Conference on Learning Representations , year =
[84]

2025 , note =

Nie, Yuzhou and Wang, Zhun and Yang, Yu and Jiang, Ruizhe and Tang, Yuheng and Davies, Xander and Gal, Yarin and Li, Bo and Guo, Wenbo and Song, Dawn , booktitle =. 2025 , note =

2025
[85]

Liu, Yu and Gao, Lang and Yang, Mingxin and Xie, Yu and Chen, Ping and Zhang, Xiaojin and Chen, Wei , journal =
[86]

arXiv preprint arXiv:2311.12420 , year =

How Far Have We Gone in Vulnerability Detection Using Large Language Models , author =. arXiv preprint arXiv:2311.12420 , year =

work page arXiv
[87]

Lian, Keke and Wang, Bin and Zhang, Lei and Chen, Libo and Wang, Junjie and Zhao, Ziming and Yang, Yujiu and Lin, Miaoqian and Duan, Haotong and others , journal =
[88]

Jing, Pengfei and Tang, Mengyun and Shi, Xiaorong and Zheng, Xing and Nie, Sen and Wu, Shi and Yang, Yong and Luo, Xiapu , journal =
[92]

Proceedings of ICLR , year=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. Proceedings of ICLR , year=
[94]

Proceedings of ACL , year=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of ACL , year=
[95]

Transactions of the ACL , year=

The NarrativeQA Reading Comprehension Challenge , author=. Transactions of the ACL , year=
[96]

Proceedings of EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of EMNLP , year=
[97]

Proceedings of ACL , year=

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. Proceedings of ACL , year=
[98]

Proceedings of ACL , year=

The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context , author=. Proceedings of ACL , year=
[99]

Proceedings of AAAI , year=

Story Cloze Evaluation: A New Dataset and Method for Evaluating Story Understanding , author=. Proceedings of AAAI , year=
[100]

Proceedings of NAACL , year=

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of NAACL , year=
[101]

Proceedings of ICLR , year=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. Proceedings of ICLR , year=
[102]

Proceedings of ACL , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of ACL , year=
[103]

Proceedings of AAAI , year=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. Proceedings of AAAI , year=

Showing first 80 references.