pith. sign in

arxiv: 2606.30790 · v1 · pith:ICO6BHOMnew · submitted 2026-06-29 · 💻 cs.CL · cs.AI

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

Pith reviewed 2026-07-01 02:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Romanized code-mixingIndic languagesLLM evaluationcode-mixed instructionsbenchmarkinstruction followingperformance degradationmultilingual models
0
0 comments X

The pith

Large language models underperform on Romanized Indic-English code-mixed instructions, with larger drops at higher mixing densities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Indi-RomCoM as a benchmark spanning seven instruction-following tasks, four Indic languages, and three controlled code-mixing intensity levels to test LLMs on Romanized code-mixed content. It shows consistent underperformance that worsens as code-mixing density increases. Reasoning tasks degrade less than detection tasks such as toxicity identification because generated explanations supply extra context. This evaluation covers proprietary, open-weight, and Indic-focused models in zero- and few-shot settings. The work targets the gap between strong monolingual performance and the realities of fluid bilingual communication in Roman script.

Core claim

Indi-RomCoM reveals that LLMs underperform on Romanized Indic-English code-mixed instructions, with performance falling as code-mixing density rises. Reasoning tasks experience milder degradation than detection tasks because the explanations they generate supply necessary context for correct answers.

What carries the argument

The Indi-RomCoM benchmark, a controlled test suite of seven instruction-following tasks across four Indic languages at three code-mixing intensity levels.

Load-bearing premise

The seven selected tasks and three controlled mixing levels accurately reflect the real-world difficulties LLMs encounter with Romanized Indic-English code-mixing.

What would settle it

A new model achieving near-monolingual accuracy on high-density RCM toxicity detection with no measurable drop relative to low-density cases.

Figures

Figures reproduced from arXiv: 2606.30790 by Avisha Das, Mihir Parmar, Mohana Ramnath, Pulkit Verma.

Figure 1
Figure 1. Figure 1: Example (LLaMa-7B-Instruct) showing failure to understand romanized “tanglish”. phenomenon called Romanized Code-Mixing (RCM) (Sengupta et al., 2024; Winata et al., 2023). Driven by character limits, typing by sound on smartphones, and conversational convenience, millions of bilingual users fluidly blend regional Indic words, English terms, and local grammar into a single sentence, using the Latin alphabet… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Indi-RomCoM creation framework. The pipeline processes English-only and Indic-native tasks through initial translation (Phase I), applies a taxonomy-guided generation engine to create controlled code-mixing at 25%, 50%, and 75% intensities (Phases II & III), and concludes with script romanization to produce the final multi-level benchmark (Phase IV). CM-ed prompts collected through crowdsou… view at source ↗
Figure 3
Figure 3. Figure 3: Word-category substitution frequency at 50%-CM (top) and 75%-CM (bottom). At 25%- CM all substitutions are CAT-C (100%) by con￾struction and are not shown. Colours: CAT-C (blue), CAT-B (orange), CAT-D (green), CAT-A (red), CAT-E (purple). Labels for segments >7%. 4.2 Prompting Protocol We evaluate all models under zero-shot and stratified 3-shot settings using structured four￾part prompt templates that enf… view at source ↗
Figure 4
Figure 4. Figure 4: Average task accuracy (%) under code-mixed (CM) instructions by model and task. Each bar reports accuracy averaged across three CM intensity levels (25%, 50%, 75%) and four Indic languages in the zero-shot setting. Error bars denote standard deviation. Dashed vertical lines separate the seven task groups. all models, the former due to morphological sensitivity, the latter exposing a critical safety gap: mo… view at source ↗
Figure 5
Figure 5. Figure 5: Per-language CM accuracy across 7 tasks and 3 LLM families. Each spider plot shows average accuracy (%) across 25%-, 50%-, and 75%-CM conditions for Proprietary (blue), Open-weight (green), and Indic-focused (red) model families. Performance decreases from Hindi to Gujarati, reflecting differences in LLM pretraining corpus coverage across languages. work establishes a rigorous foundation for fu￾ture resear… view at source ↗
Figure 6
Figure 6. Figure 6: CMI distributions across tasks at each CM intensity level (violin plots). Dashed lines at [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs) perform strongly on monolingual and native-script benchmarks, their ability to follow instructions and reason over RCM-based content remains largely unexplored. To this end, we introduce the Indi-RomCoM benchmark for facilitating systematic evaluation on Indic Romanized Code-Mixed instructions. Our benchmark spans seven instruction-following tasks, four widely spoken Indic languages, and three controlled code-mixing intensity levels. We extensively evaluate a suite of LLMs covering proprietary, open-weight, and Indic-focused models under zero- and few-shot settings. LLMs consistently underperform on RCM instructions, with performance degrading as code-mixing density increases. Furthermore, reasoning tasks suffer less degradation than detection tasks (e.g., Toxicity) because the generated explanations offer necessary context. We believe Indi-RomCoM helps the community in developing inclusive multilingual systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Indi-RomCoM, a benchmark for evaluating LLMs on Romanized Indic-English code-mixed instructions. It spans seven instruction-following tasks, four Indic languages, and three controlled code-mixing intensity levels. Evaluations of proprietary, open-weight, and Indic-focused LLMs in zero- and few-shot settings show consistent underperformance on RCM instructions that worsens with higher code-mixing density. Reasoning tasks degrade less than detection tasks (e.g., Toxicity) because generated explanations supply necessary context.

Significance. If the benchmark construction and evaluations are sound, this work identifies a practically relevant gap in LLM handling of prevalent real-world communication patterns in Indic communities. The controlled mixing levels enable systematic study of degradation trends, and the reasoning-vs-detection distinction provides a concrete observation that could inform targeted improvements in multilingual model training.

major comments (2)
  1. [Abstract] Abstract: The abstract states performance degradation results but provides no details on evaluation methodology, dataset construction, statistical tests, or error analysis, preventing verification of whether the claims are supported by the data.
  2. [Benchmark Construction] Benchmark design: The claim that the seven chosen instruction-following tasks and three controlled code-mixing intensity levels accurately capture real-world challenges requires explicit justification or comparison against natural code-mixing distributions; without this, the generalizability of the degradation findings is difficult to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states performance degradation results but provides no details on evaluation methodology, dataset construction, statistical tests, or error analysis, preventing verification of whether the claims are supported by the data.

    Authors: We acknowledge that the abstract prioritizes brevity and omits methodological specifics. The full paper details the zero- and few-shot evaluation protocol, model suite, dataset construction, and result analysis (including degradation trends) in Sections 3–5. We will revise the abstract to concisely reference the evaluation settings, controlled mixing levels, and the presence of supporting analysis in the main text. revision: yes

  2. Referee: [Benchmark Construction] Benchmark design: The claim that the seven chosen instruction-following tasks and three controlled code-mixing intensity levels accurately capture real-world challenges requires explicit justification or comparison against natural code-mixing distributions; without this, the generalizability of the degradation findings is difficult to assess.

    Authors: Task selection draws from standard instruction-following categories (classification, generation, reasoning) commonly used in multilingual NLP benchmarks to reflect practical applications. Mixing levels follow established linguistic metrics for code-mixing density to enable controlled isolation of effects. We agree that explicit justification and references to natural distributions would strengthen the paper. We will add a dedicated paragraph in the benchmark construction section with citations to prior work on Indic code-mixing and rationale for the synthetic control approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark study

full rationale

The paper creates the Indi-RomCoM benchmark across seven tasks, four languages, and three code-mixing levels, then evaluates LLMs under zero- and few-shot settings. No equations, derivations, parameter fitting, or self-referential claims appear in the provided text. The central claims (underperformance on RCM, degradation with density, reasoning vs. detection differences) are direct empirical observations, not reductions to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. This matches the default case of a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark design rests on domain assumptions about task selection and mixing level representativeness without external validation mentioned.

axioms (1)
  • domain assumption The seven instruction-following tasks and three controlled code-mixing intensity levels are appropriate and representative for evaluating LLM performance on RCM.
    Invoked when describing the benchmark construction and evaluation scope.

pith-pipeline@v0.9.1-grok · 5718 in / 1086 out tokens · 65320 ms · 2026-07-01T02:23:22.454055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 27 canonical work pages · 8 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.00332 , year=

    Disentangling codemixing in chats: The NUS ABC codemixed corpus , author=. arXiv preprint arXiv:2506.00332 , year=

  2. [2]

    World Englishes , volume=

    Are there syntactic constraints on code-mixing? , author=. World Englishes , volume=. 1989 , publisher=

  3. [3]

    Languages , volume=

    Code-switching in linguistics: A position paper , author=. Languages , volume=. 2020 , publisher=

  4. [4]

    International Encyclopedia of the Social and Behavioral Sciences , editor =

    Poplack, Shana , title =. International Encyclopedia of the Social and Behavioral Sciences , editor =

  5. [5]

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Liang and Chen, Weizhu and others , journal=. Lo

  6. [6]

    Language policy , volume=

    National language policy theory: Exploring Spolsky’s model in the case of Iceland , author=. Language policy , volume=. 2016 , publisher=

  7. [7]

    The Palgrave handbook of minority languages and communities , pages=

    Minorities, languages, education, and assimilation in Southeast Asia , author=. The Palgrave handbook of minority languages and communities , pages=. 2018 , publisher=

  8. [8]

    1993 , publisher=

    Social motivations for codeswitching: Evidence from Africa , author=. 1993 , publisher=

  9. [9]

    IEEE Access , volume=

    BharatBhasaNet-a unified framework to identify Indian code mix languages , author=. IEEE Access , volume=. 2024 , publisher=

  10. [10]

    Estimating Code-Switching on T witter with a Novel Generalized Word-Level Language Detection Technique

    Rijhwani, Shruti and Sequiera, Royal and Choudhury, Monojit and Bali, Kalika and Maddila, Chandra Shekhar. Estimating Code-Switching on T witter with a Novel Generalized Word-Level Language Detection Technique. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1180

  11. [12]

    Humanities and Social Sciences Communications , volume=

    Social, economic, and demographic factors drive the emergence of Hinglish code-mixing on social media , author=. Humanities and Social Sciences Communications , volume=. 2024 , publisher=

  12. [13]

    ACM computing surveys , volume=

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM computing surveys , volume=. 2023 , publisher=

  13. [14]

    Journal of Machine Learning Research , volume=

    Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

  14. [15]

    Journal of machine learning research , volume=

    Palm: Scaling language modeling with pathways , author=. Journal of machine learning research , volume=

  15. [16]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  16. [17]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

  17. [18]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

    Romansetu: Efficiently unlocking multilingual capabilities of large language models via romanization , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

  18. [19]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

    Indicllmsuite: A blueprint for creating pre-training and fine-tuning datasets for indian languages , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

  19. [20]

    arXiv preprint arXiv:2501.13912 , year=

    Analysis of Indic Language Capabilities in LLMs , author=. arXiv preprint arXiv:2501.13912 , year=

  20. [21]

    Proceedings of the Twelfth Language Resources and Evaluation Conference , year=

    LinCE: A centralized benchmark for linguistic code-switching evaluation , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , year=

  21. [22]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

    GLUECoS: An evaluation benchmark for code-switched NLP , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

  22. [23]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

    CodeMixBench: Evaluating code-mixing capabilities of LLMs across 18 languages , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

  23. [24]

    Proceedings of the 2022 conference on empirical methods in natural language processing , year=

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks , author=. Proceedings of the 2022 conference on empirical methods in natural language processing , year=

  24. [27]

    Proceedings of the 2022 conference on empirical methods in natural language processing , year=

    IndicXNLI: Evaluating multilingual inference for Indian languages , author=. Proceedings of the 2022 conference on empirical methods in natural language processing , year=

  25. [30]

    and Schwenk, Holger and Stoyanov, Veselin

    Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  26. [32]

    Findings of the association for computational linguistics: Emnlp 2023 , year=

    Aksharantar: Open Indic-language transliteration datasets and models for the next billion users , author=. Findings of the association for computational linguistics: Emnlp 2023 , year=

  27. [33]

    Transactions on Machine Learning Research , issn=

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

  28. [34]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , year=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , year=

  29. [35]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year=

    Bhasa-abhijnaanam: Native-script and romanized language identification for 22 Indic languages , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year=

  30. [36]

    Biometrics , volume=

    The Wilcoxon signed rank test for paired comparisons of clustered data , author=. Biometrics , volume=. 2006 , publisher=

  31. [37]

    2013 , publisher=

    Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=

  32. [38]

    Psychometrika , volume=

    Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

  33. [39]

    Proceedings of the 2023 conference on empirical methods in natural language processing , year=

    Multilingual large language models are not (yet) code-switchers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , year=

  34. [41]

    Ian Webster , title =

  35. [45]

    L in CE : A Centralized Benchmark for Linguistic Code-switching Evaluation

    Aguilar, Gustavo and Kar, Sudipta and Solorio, Thamar. L in CE : A Centralized Benchmark for Linguistic Code-switching Evaluation. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  36. [47]

    Yang, Qingyan and Wang, Tongxi and Luo, Yunsheng , journal=

  37. [50]

    Chitale, Pranjal A and Gumma, Varun and Ahuja, Sanchit and Kodali, Prashant and Uppadhyay, Manan and Sudharsan, Deepthi and Sitaram, Sunayana , journal=

  38. [51]

    Airavata: Introducing Hindi Instruction-tuned

    Gala, Jay and Jayakumar, Thanmay and Husain, Jaavid Aktar and Khan, Mohammed Safi Ur Rahman and Kanojia, Diptesh and Puduppully, Ratish and Khapra, Mitesh M and Dabre, Raj and Murthy, Rudra and Kunchukuttan, Anoop and others , journal=. Airavata: Introducing Hindi Instruction-tuned

  39. [52]

    Balachandran, Abhinand , journal=

  40. [53]

    Hugging Face repository , howpublished =

    Sarvam-1 , year =. Hugging Face repository , howpublished =

  41. [54]

    Dawar, Aviral and Karanth, Roshan and Goyal, Vikram and Kumar, Dhruv , journal=

  42. [55]

    Pattnayak, Priyaranjan and Chowdhuri, Sanchari , journal=

  43. [56]

    Proceedings of the 2022 conference on empirical methods in natural language processing , year=

    IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages , author=. Proceedings of the 2022 conference on empirical methods in natural language processing , year=

  44. [57]

    2026 , month =

    Kavukcuoglu, Koray and Dean, Jeff and Vinyals, Oriol and Shazeer, Noam , title =. 2026 , month =

  45. [58]

    2026 , organization =

  46. [59]

    2026 , month =

    Introducing. 2026 , month =

  47. [62]

    Models Overview , year =

  48. [64]

    2026 , month =

    Farabet, Clement and Lacombe, Olivier , title =. 2026 , month =

  49. [65]

    Divyanshu Aggarwal, Vivek Gupta, and Anoop Kunchukuttan. 2022. Indicxnli: Evaluating multilingual inference for indian languages. In Proceedings of the 2022 conference on empirical methods in natural language processing

  50. [66]

    Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. 2020. https://aclanthology.org/2020.lrec-1.223/ L in CE : A centralized benchmark for linguistic code-switching evaluation . In Proceedings of the Twelfth Language Resources and Evaluation Conference

  51. [67]

    Anthropic . 2026. https://www.anthropic.com/news/claude-opus-4-6 Introducing Claude Opus 4.6 . Anthropic News. Accessed: May 24, 2026

  52. [68]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.172 L ong B ench: A bilingual, multitask benchmark for long context understanding . In Proceedings of the 62nd Annual Meeting of the Association for ...

  53. [69]

    Abhinand Balachandran. 2023. Tamil-Llama : A new Tamil language model based on Llama 2. arXiv preprint arXiv:2311.05845

  54. [70]

    Eyamba G Bokamba. 1989. Are there syntactic constraints on code-mixing? World Englishes, 8(3):277--292

  55. [71]

    Pranjal A Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, and Sunayana Sitaram. 2025. UPDESH: synthesizing grounded instruction tuning data for 13 indic languages. arXiv preprint arXiv:2509.21294

  56. [72]

    Mukund Choudhary, Madhur Jindal, Gaurja Aeron, and Monojit Choudhury. 2026. https://doi.org/10.18653/v1/2026.findings-eacl.291 Do LLM s model human linguistic variation? a case study in H indi- E nglish verb code-mixing . In Findings of the A ssociation for C omputational L inguistics: EACL 2026

  57. [73]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  58. [74]

    Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences. routledge

  59. [75]

    Bowman, Holger Schwenk, and Veselin Stoyanov

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

  60. [76]

    Aviral Dawar, Roshan Karanth, Vikram Goyal, and Dhruv Kumar. 2026. IndicDB --benchmarking multilingual text-to- SQL capabilities in indian languages. arXiv preprint arXiv:2604.13686

  61. [77]

    Margaret Deuchar. 2020. Code-switching in linguistics: A position paper. Languages, 5(2):22

  62. [78]

    Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan, and 1 others. 2024. Airavata: Introducing hindi instruction-tuned LLM . arXiv preprint arXiv:2401.15006

  63. [79]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  64. [80]

    Ayushman Gupta, Akhil Bhogal, and Kripabandhu Ghosh. 2024. Code-mixer ya nahi: Novel approaches to measuring multilingual llms' code-mixing capabilities. arXiv preprint arXiv:2410.11079

  65. [81]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lo RA : Low-rank adaptation of large language models. Iclr, 1(2):3

  66. [82]

    J Jaavid, Raj Dabre, M Aswanth, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, and Anoop Kunchukuttan. 2024. Romansetu: Efficiently unlocking multilingual capabilities of large language models via romanization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  67. [83]

    Koray Kavukcuoglu, Jeff Dean, Oriol Vinyals, and Noam Shazeer. 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/#gemini-3-5-flash Gemini 3.5: Frontier intelligence with action . Google Blog (The Keyword). Accessed: May 24, 2026

  68. [84]

    Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, and Mitesh M. Khapra. 2024 a . https://doi.org/10.18653/v1/2024.acl-long.843 I ndic LLMS uite: A blueprint for creating pre-training and fine-tuning datasets for I ndi...

  69. [85]

    Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M Khapra, and 1 others. 2024 b . Indicllmsuite: A blueprint for creating pre-training and fine-tuning datasets for indian languages. In Proceedings of the 62nd Annual Meeting of the Asso...

  70. [86]

    Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, and Monojit Choudhury. 2020 a . Gluecos: An evaluation benchmark for code-switched nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

  71. [87]

    Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, and Monojit Choudhury. 2020 b . https://doi.org/10.18653/v1/2020.acl-main.329 GLUEC o S : An evaluation benchmark for code-switched NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

  72. [88]

    Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Avik Bhattacharyya, Mitesh M Khapra, Pratyush Kumar, and 1 others. 2020. Ai4bharat-super corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085

  73. [89]

    Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. 2024. https://doi.org/10.18653/v1/2024.acl-long.832 M 4 LE : A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models . In Proceedings of the 62nd Annual Meeting of the Association ...

  74. [90]

    Yash Madhani, Mitesh M Khapra, and Anoop Kunchukuttan. 2023 a . Bhasa-abhijnaanam: Native-script and romanized language identification for 22 indic languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

  75. [91]

    Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul Nc, Ruchi Khapra, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M Khapra. 2023 b . Aksharantar: Open indic-language transliteration datasets and models for the next billion users. In Findings of the association for computational linguistics: Emnlp 2023

  76. [92]

    Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153--157

  77. [93]

    Mistral AI . 2026. https://docs.mistral.ai/models/overview Models overview . Mistral AI Documentation. Accessed: May 24, 2026

  78. [94]

    Carol Myers-Scotton. 1993. Social motivations for codeswitching: Evidence from Africa. Oxford University Press

  79. [95]

    OpenAI . 2026. https://developers.openai.com/api/docs/models/gpt-3.5-turbo GPT-3.5 Turbo model documentation . OpenAI Developer Documentation. Accessed: May 24, 2026

  80. [96]

    Priyaranjan Pattnayak and Sanchari Chowdhuri. 2026. IndicSafe: a benchmark for evaluating multilingual LLM safety in south asia. arXiv preprint arXiv:2603.17915

Showing first 80 references.