Recognition: 3 theorem links
· Lean TheoremLessons from the Trenches on Reproducible Evaluation of Language Models
Pith reviewed 2026-05-16 18:41 UTC · model grok-4.3
The pith
The Language Model Evaluation Harness provides standardized tools and practices to make evaluations of language models reproducible and comparable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Effective evaluation of language models remains an open challenge in NLP due to methodological issues such as sensitivity to evaluation setup, difficulty of proper comparisons across methods, and lack of reproducibility and transparency. Drawing on experience, the authors provide an overview of challenges, delineate best practices, and present the Language Model Evaluation Harness as an open source library for independent, reproducible, and extensible evaluation of language models.
What carries the argument
The Language Model Evaluation Harness (lm-eval), an open source library that implements standardized evaluation tasks and protocols to support consistent and extensible testing of language models.
If this is right
- Researchers gain the ability to run evaluations independently without depending on original authors' code or setups.
- Comparisons between different language models and methods become more reliable due to reduced sensitivity to implementation details.
- Transparency improves as the library makes evaluation code and tasks publicly available and modifiable.
- New evaluation tasks can be added in a way that maintains compatibility with existing ones.
- Case studies demonstrate the library's use in addressing real methodological concerns in published research.
Where Pith is reading between the lines
- This standardization could reduce wasted effort spent on re-implementing evaluations across different research groups.
- Adoption might shift focus from evaluation engineering toward actual model innovations in natural language processing.
- Similar libraries could be developed for other machine learning domains facing reproducibility issues.
- Long-term use might allow better tracking of progress by enabling direct comparisons over time.
Load-bearing premise
The primary barriers to reproducible evaluation are inconsistent setups and lack of shared tools, and introducing a common library will reduce these issues without creating new methodological problems of its own.
What would settle it
Running the same set of models through the library in multiple independent environments and observing significant unexplained differences in results would challenge whether the library truly achieves reproducibility.
read the original abstract
Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the Language Model Evaluation Harness (lm-eval): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper draws on three years of experience evaluating large language models to outline common methodological challenges (sensitivity to setup, comparison difficulties, reproducibility gaps), delineate best practices for mitigation, and introduce the open-source lm-eval library with its features and case studies to support independent, reproducible, and extensible evaluations.
Significance. If the library's design and documented practices hold, the work provides a practical, community-oriented contribution that can materially improve comparability and transparency in NLP research by reducing common evaluation pitfalls through reusable tooling rather than ad-hoc scripts.
minor comments (2)
- [Library features and case studies] The description of library features would benefit from explicit cross-references to the case studies (e.g., which feature directly resolved a reproducibility issue in a given study).
- [Conclusion] A brief note on maintenance and versioning strategy for the open-source release would strengthen the reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects the paper's focus on practical lessons from LM evaluation experience and the role of the lm-eval library in addressing reproducibility challenges.
Circularity Check
No significant circularity; library presented as independent engineering artifact
full rationale
The paper draws on external experience to enumerate known methodological sensitivities in LM evaluation, offers concrete best practices, and releases lm-eval as an open-source implementation. No equations, fitted parameters, or predictions appear; no self-citation chain is invoked to justify a uniqueness theorem or force a result. The central claim reduces to documentation of observed problems plus a reusable tool whose value is shown by usage, not by internal re-derivation of its own inputs. This is the normal non-circular case for a best-practices and tooling paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial result...
-
Visual Text Compression as Measure Transport
Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.
-
Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
SFT on procedural skills yields uniform gains of 4-7.5 percentage points across 0.8B-4B Qwen models, driven by a W-shaped pre-SFT base trajectory where SFT compensates most for initial weaknesses.
-
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
-
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
-
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid...
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
SFT delivers uniform procedural skill gains of 4-7.5 points across 0.8B-4B models while pre-SFT performance follows a W-shape, making SFT most effective where base models struggle.
-
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Submodular Benchmark Selection
Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.
-
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation
Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.
-
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
Reference graph
Works this paper leans on
- [1]
-
[2]
Proceedings of the 40th International Conference on Machine Learning , pages =
Prompting Large Language Model for Machine Translation: A Case Study , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[3]
Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=
-
[4]
9th International Conference on Learning Representations,
Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
-
[5]
Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=
-
[6]
Journal of Machine Learning Research , volume=
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=
-
[8]
The Tenth International Conference on Learning Representations , year=
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. The Tenth International Conference on Learning Representations , year=
-
[9]
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=
-
[10]
The 61st Annual Meeting Of The Association For Computational Linguistics , year=
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=
-
[11]
Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , journal=
-
[12]
Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel , url =. 2021 , version =. doi:10.5281/zenodo.5879544 , month =
-
[13]
Computing Research Repository , volume=
Efficient Large Scale Language Modeling with Mixtures of Experts , author=. Computing Research Repository , volume=. 2021 , eprint=
work page 2021
-
[14]
Wang, Ben and Komatsuzaki, Aran , year=
-
[15]
Jurassic-1: Technical details and evaluation , author=. White Paper. AI21 Labs , year=
-
[16]
A Conversational Paradigm for Program Synthesis , author=. arXiv preprint , year=
-
[17]
NVIDIA Developer Blog , publisher=
Kharya, Paresh and Alvi, Ali , title=. NVIDIA Developer Blog , publisher=
-
[18]
Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=
-
[19]
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Joshi, Pratik and Santy, Sebastin and Budhiraja, Amar and Bali, Kalika and Choudhury, Monojit. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.560
-
[20]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author=. 2021 , journal=. 2107.03374 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Diab and Veselin Stoyanov and Xian Li , title =
Xi Victoria Lin and Todor Mihaylov and Mikel Artetxe and Tianlu Wang and Shuohui Chen and Daniel Simig and Myle Ott and Naman Goyal and Shruti Bhosale and Jingfei Du and Ramakanth Pasunuru and Sam Shleifer and Punit Singh Koura and Vishrav Chaudhary and Brian O'Horo and Jeff Wang and Luke Zettlemoyer and Zornitsa Kozareva and Mona T. Diab and Veselin Stoy...
-
[22]
arXiv preprint arXiv:2202.07206 , url=
Impact of Pretraining Term Frequencies on Few-Shot Reasoning , author=. arXiv preprint arXiv:2202.07206 , url=
-
[23]
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=. 2022 , eprint=
work page 2022
-
[24]
What Language Model to Train if You Have One Million
Teven Le Scao and Thomas Wang and Daniel Hesslow and Lucile Saulnier and Stas Bekman and M Saiful Bari and Stella Biderman and Hady Elsahar and Jason Phang and Ofir Press and Colin Raffel and Victor Sanh and Sheng Shen and Lintang Sutawika and Jaesung Tae and Zheng Xin Yong and Julien Launay and Iz Beltagy , booktitle=. What Language Model to Train if You...
work page 2022
-
[25]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , url=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[26]
Scaling Laws for Autoregressive Generative Modeling
Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , url=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[27]
arXiv preprint arXiv:2102.01293 , url=
Scaling laws for transfer , author=. arXiv preprint arXiv:2102.01293 , url=
-
[28]
arXiv preprint arXiv:2108.11018 , url=
A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective? , author=. arXiv preprint arXiv:2108.11018 , url=
-
[29]
Scaling Effect of Self-Supervised Speech Models , author=. Proc. Interspeech 2021 , pages=
work page 2021
-
[30]
arXiv preprint arXiv:2004.10802 , url=
A neural scaling law from the dimension of the data manifold , author=. arXiv preprint arXiv:2004.10802 , url=
-
[31]
arXiv preprint arXiv:2109.07740 , url=
Scaling Laws for Neural Machine Translation , author=. arXiv preprint arXiv:2109.07740 , url=
-
[32]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=
-
[34]
7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures , author=. 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=
-
[35]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[36]
Measuring Coding Challenge Competence With APPS , author=. NeurIPS , year=
-
[37]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[38]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint 2206.04615 , url=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[40]
BLEURT: Learning Robust Metrics for Text Generation , author =. 2020 , booktitle =
work page 2020
-
[41]
A Call for Clarity in Reporting BLEU Scores
Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018
work page 2018
-
[42]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Roformer: Enhanced transformer with rotary position embedding , author=. arXiv preprint arXiv:2104.09864 , url=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Rotary Embeddings: A Relative Revolution , author =
-
[44]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , url=
work page 2020
-
[45]
Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=
-
[46]
arXiv preprint arXiv:2103.12028 , url=
Quality at a glance: An audit of web-crawled multilingual datasets , author=. arXiv preprint arXiv:2103.12028 , url=
-
[47]
2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) , pages=
On the Jeffreys-Lindley Paradox and the looming reproducibility crisis in machine learning , author=. 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) , pages=. 2017 , organization=
work page 2017
-
[48]
Artificial intelligence faces reproducibility crisis , author=. 2018 , publisher=
work page 2018
-
[49]
NPJ digital medicine , volume=
The reproducibility crisis in the age of digital medicine , author=. NPJ digital medicine , volume=. 2019 , publisher=
work page 2019
-
[50]
Advances in Neural Information Processing Systems , volume=
A step toward quantifying independently reproducible machine learning research , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
Challenges to the reproducibility of machine learning models in health care , author=. Jama , volume=. 2020 , publisher=
work page 2020
-
[52]
''I Can't Believe It's Not Better!''NeurIPS 2020 workshop , year=
Pitfalls in Machine Learning Research: Reexamining the Development Cycle , author=. ''I Can't Believe It's Not Better!''NeurIPS 2020 workshop , year=
work page 2020
-
[53]
International Conference on Learning Representations , year=
Evaluating The Search Phase of Neural Architecture Search , author=. International Conference on Learning Representations , year=
-
[54]
Uncertainty in artificial intelligence , pages=
Random search and reproducibility for neural architecture search , author=. Uncertainty in artificial intelligence , pages=. 2020 , organization=
work page 2020
-
[55]
International Conference on Machine Learning , pages=
Nas-bench-101: Towards reproducible neural architecture search , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[56]
Best practices in machine learning for chemistry , author=. Nature Chemistry , volume=. 2021 , publisher=
work page 2021
-
[57]
The Thirty-Fifth AAAI Conference on Artificial Intelligence , year=
Research Reproducibility as a Survival Analysis , author=. The Thirty-Fifth AAAI Conference on Artificial Intelligence , year=
-
[58]
Journal of Machine Learning Research , volume=
Improving reproducibility in machine learning research: a report from the NeurIPS 2019 reproducibility program , author=. Journal of Machine Learning Research , volume=. 2021 , publisher=
work page 2019
-
[59]
ICLR reproducibility challenge 2019 , author=. ReScience C , volume=
work page 2019
-
[60]
arXiv preprint arXiv:2106.15590 , url=
The values encoded in machine learning research , author=. arXiv preprint arXiv:2106.15590 , url=
-
[61]
Transformers: State-of-the-Art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...
-
[62]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[63]
Efficient large-scale language model training on GPU clusters using megatron-LM , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[64]
Scaling Up Models and Data with
Roberts, Adam and Chung, Hyung Won and Levskaya, Anselm and Mishra, Gaurav and Bradbury, James and Andor, Daniel and Narang, Sharan and Lester, Brian and Gaffney, Colin and Mohiuddin, Afroz and others , journal=. Scaling Up Models and Data with
-
[65]
Nekoto, Wilhelmina and Marivate, Vukosi and Matsila, Tshinondiwa and Fasubaa, Timi and Fagbohungbe, Taiwo and Akinola, Solomon Oluwole and Muhammad, Shamsuddeen and Kabongo Kabenamualu, Salomon and Osei, Salomey and Sackey, Freshia and Niyongabo, Rubungo Andre and Macharm, Ricky and Ogayo, Perez and Ahia, Orevaoghene and Berhe, Musie Meressa and Adeyemi, ...
work page 2020
-
[66]
Dynabench: Rethinking Benchmarking in NLP
Kiela, Douwe and Bartolo, Max and Nie, Yixin and Kaushik, Divyansh and Geiger, Atticus and Wu, Zhengxuan and Vidgen, Bertie and Prasad, Grusha and Singh, Amanpreet and Ringshia, Pratik and Ma, Zhiyi and Thrush, Tristan and Riedel, Sebastian and Waseem, Zeerak and Stenetorp, Pontus and Jia, Robin and Bansal, Mohit and Potts, Christopher and Williams, Adina...
work page 2021
-
[67]
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Gehrmann, Sebastian and Adewumi, Tosin and Aggarwal, Karmanya and Ammanamanchi, Pawan Sasanka and Aremu, Anuoluwapo and Bosselut, Antoine and Chandu, Khyathi Raghavi and Clinciu, Miruna-Adriana and Das, Dipanjan and Dhole, Kaustubh and Du, Wanyu and Durmus, Esin and Du s ek, Ond r ej and Emezue, Chris Chinenye and Gangal, Varun and Garbacea, Cristina and ...
-
[68]
arXiv preprint arXiv:2112.02721 , year=
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , author=. arXiv preprint arXiv:2112.02721 , year=
- [69]
-
[70]
FLEX: Unifying Evaluation for Few-Shot NLP , author=. 2021 , booktitle=
work page 2021
-
[71]
S yntax G ym: An Online Platform for Targeted Evaluation of Language Models
Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger. S yntax G ym: An Online Platform for Targeted Evaluation of Language Models. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020. doi:10.18653/v1/2020.acl-demos.10
-
[72]
arXiv preprint arXiv:2010.06060 , year=
BioMegatron: Larger biomedical domain language model , author=. arXiv preprint arXiv:2010.06060 , year=
-
[73]
arXiv preprint arXiv:2010.00840 , year=
MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models , author=. arXiv preprint arXiv:2010.00840 , year=
-
[74]
A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model , author=. Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models , year=
-
[75]
arXiv preprint arXiv:2202.13169 , year=
A systematic evaluation of large language models of code , author=. arXiv preprint arXiv:2202.13169 , year=
-
[76]
International Conference on Learning Representations , year=
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author=. International Conference on Learning Representations , year=
-
[77]
Research community dynamics behind popular
Mart. Research community dynamics behind popular. Nature Machine Intelligence , volume = 3, number = 7, pages =. doi:10.1038/s42256-021-00339-6 , url =
-
[78]
Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Marie, Benjamin and Fujita, Atsushi and Rubino, Raphael. Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2...
-
[79]
Bawden, Rachel and Yvon, François , title =. Proceedings of the 24th Annual Conference of the European Association for Machine Translation , url =. 2023 , notes =
work page 2023
-
[80]
Comparing the uncomparable to claim the state of the art: A concerning trend
Marie, Benjamin. Comparing the uncomparable to claim the state of the art: A concerning trend. Blog post
-
[82]
Large Language Models are not Fair Evaluators , author=. 2023 , eprint=
work page 2023
-
[83]
With Little Power Comes Great Responsibility , author=. 2020 , eprint=
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.