arxiv: 2604.21108 · v2 · submitted 2026-04-22 · 💻 cs.CL · cs.LG

Recognition: unknown

Machine learning and emoji prediction: How much accuracy can MARBERT achieve?

Mohammed Q. Shormani , Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:03 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Arabic tweetsemoji predictionMARBERTmachine learningaccuracymultidialectal Arabictransformer modelsocial media

0 comments

The pith

MARBERT reaches 0.75 accuracy predicting emojis from Arabic tweets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the MARBERT model can predict the emoji attached to an Arabic tweet. The authors collected 11379 colloquial tweets from X.com across multiple dialects, preprocessed them down to 8695 examples, and labeled each with one of 14 emoji categories. They fine-tuned MARBERT on the text to output the emoji label and measured performance with precision, recall, F1, and overall accuracy. The model reached 0.75 accuracy, which the authors treat as a positive result for a low-resource multidialectal language while noting that further gains are possible. A reader would care because emoji use is central to informal online Arabic communication and better predictors could support improved sentiment tools or content analysis.

Core claim

The paper claims that fine-tuning the MARBERT transformer on a dataset of 8695 preprocessed Arabic tweets, each assigned to one of 14 emoji categories, yields an overall accuracy of 0.75 for emoji prediction from text. Performance is reported via precision, recall, and F1-scores, and the authors conclude that the outcome is promising yet indicates a continuing need to strengthen machine learning models for low-resource, multidialectal languages such as Arabic.

What carries the argument

Fine-tuned MARBERT, a bidirectional transformer pretrained on Arabic, applied to tweet text to predict one of 14 emoji labels after a preprocessing pipeline that extracts and numerically encodes lexical features.

Load-bearing premise

The 8695 tweets after preprocessing represent real-world Arabic emoji usage across dialects without systematic bias from the 14-category labeling scheme.

What would settle it

Testing the fine-tuned MARBERT on a new, larger, independently gathered set of Arabic tweets from varied regions and obtaining accuracy well below 0.75 would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2604.21108 by Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh, Mohammed Q. Shormani.

**Figure 2.** Figure 2: MARBERT workflow [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: MAREBERT Training Loss vs Validation Loss [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

This study investigates Machine Learning (ML) in the prediction of emojis in Arabic tweets employing the (state-of-the-art) MARBERT model. A corpus of 11379 CA tweets representing multiple Arabic colloquial dialects was collected from X.com via Python. A net dataset includes 8695 tweets, which were utilized for the analysis. These tweets were then classified into 14 categories, which were numerically encoded and used as labels. A preprocessing pipeline was designed as an interpretable baseline, allowing us to examine the relationship between lexical features and emoji categories. MARBERT was finetuned to predict emoji use from textual input. We evaluated the model performance in terms of precision, recall and F1-scores. Findings reveal that the model performed quite well with an overall accuracy 0.75. The study concludes that although the findings are promising, there is still a need for improving machine learning models including MARBERT, specifically for low-resource and multidialectal languages like Arabic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard fine-tuning of MARBERT on Arabic tweets for emoji prediction reaches 0.75 accuracy, but thin experimental details make the number hard to evaluate or build on.

read the letter

This paper is basically a standard application of the MARBERT model to predict emojis in Arabic tweets. They gathered over 11k tweets, cleaned them down to 8695, labeled them with 14 emoji categories, and after fine-tuning reported an overall accuracy of 0.75 along with precision, recall, and F1 scores. What stands out positively is the effort to build a dataset covering multiple colloquial Arabic dialects from X.com. The preprocessing pipeline is described in a way that lets readers see how lexical features relate to the emoji labels, which adds some interpretability. For a low-resource language setting like Arabic, having even a modest benchmark like this can be handy for others working on social media analysis. The weaker parts are around the evaluation and reporting. There are no specifics on the train-test split, whether they used cross-validation, any hyperparameter search, or comparisons to other models or simpler approaches. The 14 categories are mentioned but without explanation of selection criteria or potential biases in labeling. This leaves the 0.75 figure floating without enough context to judge if it's impressive or expected. The dataset size is also on the smaller side for robust claims about real-world performance across dialects. Overall, this kind of work is for people in Arabic NLP or emoji-related sentiment analysis who need a quick reference or starting dataset. It doesn't introduce new techniques, but it does provide concrete numbers on a practical task. A reader might get value from the data collection details if they're building similar systems. I'd say it should go to peer review. The core idea is sound enough, but the authors need to fill in the experimental gaps to make it a solid contribution that others can rely on.

Referee Report

4 major / 2 minor

Summary. The paper collects 11,379 Arabic tweets from X.com, preprocesses them to a final set of 8,695 tweets, assigns them to 14 emoji categories, and fine-tunes the MARBERT model for emoji prediction. It reports an overall accuracy of 0.75 along with precision, recall, and F1 scores, and concludes that the results are promising but that further improvements are needed for multidialectal, low-resource languages such as Arabic.

Significance. If the performance numbers are reproducible under standard evaluation protocols, the work supplies a concrete data point on the effectiveness of a state-of-the-art Arabic BERT variant for a downstream classification task. The multidialectal tweet collection is a modest positive contribution, but the study is a routine fine-tuning exercise rather than a methodological advance.

major comments (4)

[Abstract] Abstract: the headline claim of 0.75 accuracy is presented without any description of the train/test split ratio, whether the split was stratified by dialect or emoji class, or whether hyper-parameter search and model selection were performed on the test set. These omissions make the central empirical result impossible to interpret or replicate.
[Abstract] Abstract / Results: no quantitative baseline (e.g., majority-class, TF-IDF + logistic regression, or the preprocessing pipeline alone) is reported, so it is impossible to judge whether the 0.75 accuracy represents an improvement over simpler methods.
[Data and preprocessing] Data collection and preprocessing: the paper states that tweets were classified into 14 categories but supplies no information on how the category inventory was chosen, whether the labeling was done by multiple annotators, or how the test set was isolated from any preprocessing decisions that could introduce leakage.
[Evaluation] Evaluation: precision, recall, and F1 are mentioned but no per-class scores, confusion matrix, or error bars from multiple runs or cross-validation folds are provided, leaving the stability of the 0.75 figure unverified.

minor comments (2)

[Abstract] Abstract: the phrasing 'a net dataset includes 8695 tweets' is unclear; state explicitly what filtering steps produced this number from the initial 11,379 tweets.
[Conclusion] The conclusion that 'there is still a need for improving machine learning models' is too generic; specify which aspects (e.g., dialect handling, emoji ambiguity) require further work.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that several aspects of the experimental description require clarification to improve reproducibility and allow proper assessment of the results. Below we respond point by point to the major comments and indicate the changes we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 0.75 accuracy is presented without any description of the train/test split ratio, whether the split was stratified by dialect or emoji class, or whether hyper-parameter search and model selection were performed on the test set. These omissions make the central empirical result impossible to interpret or replicate.

Authors: We agree that the abstract omits key methodological details. The full manuscript contains the experimental protocol, but we will revise the abstract to state the train/test split ratio, indicate whether stratification by emoji class or dialect was applied, and clarify that hyper-parameter tuning and model selection were performed on a validation set held out from the training data only. We will also add a dedicated 'Experimental Setup' subsection in the methods section. revision: yes
Referee: [Abstract] Abstract / Results: no quantitative baseline (e.g., majority-class, TF-IDF + logistic regression, or the preprocessing pipeline alone) is reported, so it is impossible to judge whether the 0.75 accuracy represents an improvement over simpler methods.

Authors: The manuscript introduces the preprocessing pipeline as an interpretable baseline, yet we acknowledge that no quantitative comparison against it or against standard baselines is reported. We will add results for a majority-class baseline and a TF-IDF + logistic regression model in the results section so that the improvement achieved by MARBERT can be directly evaluated. revision: yes
Referee: [Data and preprocessing] Data collection and preprocessing: the paper states that tweets were classified into 14 categories but supplies no information on how the category inventory was chosen, whether the labeling was done by multiple annotators, or how the test set was isolated from any preprocessing decisions that could introduce leakage.

Authors: We will expand the data and preprocessing section to describe the rationale for selecting the 14 emoji categories, detail the labeling procedure, and explicitly state that the test set was isolated prior to any preprocessing steps. We will also note the limitation that labeling was performed by the authors rather than multiple independent annotators. revision: yes
Referee: [Evaluation] Evaluation: precision, recall, and F1 are mentioned but no per-class scores, confusion matrix, or error bars from multiple runs or cross-validation folds are provided, leaving the stability of the 0.75 figure unverified.

Authors: We agree that per-class metrics and a confusion matrix are necessary for a complete evaluation. We will add both to the results section. Because the reported experiments consist of a single training run, we will either conduct additional runs with varied random seeds to report variance or explicitly discuss this as a limitation; the revision will make the choice clear. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical report on fine-tuning MARBERT for emoji classification on a collected Arabic tweet corpus. It describes data collection, preprocessing, labeling into 14 categories, model fine-tuning, and reports standard metrics (accuracy 0.75, precision/recall/F1). No derivation chain, equations, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central result is a direct experimental outcome on held-out evaluation, not equivalent to its inputs by construction. The work is self-contained as a routine ML application without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the domain assumption that MARBERT's pretraining transfers usefully to emoji prediction; no new entities are postulated and the only free parameters are the usual finetuning hyperparameters whose values are not reported.

axioms (2)

domain assumption MARBERT embeddings capture sufficient lexical and dialectal information for emoji prediction in Arabic
Invoked when the authors finetune MARBERT directly on the tweet-emoji pairs without additional feature engineering beyond the preprocessing baseline.
ad hoc to paper The 14 emoji categories are mutually exclusive and exhaustive for the collected tweets
The numerical encoding of categories assumes clean, non-overlapping labels that do not require multi-label handling.

pith-pipeline@v0.9.0 · 5474 in / 1482 out tokens · 114479 ms · 2026-05-10T00:03:06.226240+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages

[1]

Abdul-Mageed, M., & Elmadany, A. (2021). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) (pp. 7088-7105). Abdul-Mageed, M., Elmadany, M. & Nagoudi, E. M....

2021
[2]

https://doi.org/10.1007/978-3-031-90921-4_96 Ahamad, R., & Mishra, K

Springer, Cham. https://doi.org/10.1007/978-3-031-90921-4_96 Ahamad, R., & Mishra, K. N. (2025). Exploring sentiment analysis in handwritten and E -text documents using advanced machine learning techniques: a novel approach. Journal of Big Data, 12(1),

work page doi:10.1007/978-3-031-90921-4_96 2025
[3]

Antoun, W. Baly, F. & Hajj, H. (2020). AraGPT2: Pre-trained Transformer for Arabic Language Generation. arXiv preprint arXiv:2012.15520,

work page arXiv 2020
[4]

Barbieri, F., Ballesteros, M., & Saggion, H. (2017). Are emojis predictable? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (pp. 105-111). Bassiouney, R. (2009). Arabic sociolinguistics. Edinburgh University Press. Chomsky, N. (1956). Three models for the description o...

work page doi:10.1109/tit.1956.1056813 2017
[5]

Khder, M. A. (2021). Web scraping or web crawling: State of art, techniques, approaches and application. International Journal of Advances in Soft Computing & Its Applications, 13(3). Levy, J. J., & O’Malley, A. J. (2020). Don’t dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Medical Resear...

2021
[6]

Mahardhika, I. E. P., Prayitno, H. J., Indri, I., & Fitriyan, M. R. (2025). Visual and contextual learning for deep learning education: A unified tool for theory –practice integration. Journal of Deep Learning, 69-80. McShane, M., Nirenburg, S. (2021). Linguistics for the age of AI. MIT Press. Murphy, K. P. (2012). Machine Learning: A probabilistic perspe...

work page arXiv 2025
[7]

https://doi.org/doi:10.1109/JRPROC.1953.274271 Sarker, I. H. (2021). Machine learning: Algorithms, real -world applications and research directions. SN Computer Science, 2(3), 1-21. Sohail, S.S., Farhat, F., Himeur, Y., Nadeem, M., Madsen, D., Singh, Y., Atalla, S. & Mansoor, W. (2023). Decoding ChatGPT: A taxonomy of existing research, current challenges...

work page doi:10.1109/jrproc.1953.274271 1953
[8]

& Alenezi, M

https://doi.org/10.1007/s10462-025-11332-5 Shormani, M.Q. & Alenezi, M. A. (2026). Arabic Nominalization as Form of Pragmatic Actions in Digital Discourse: A Corpus -Based Study. Corpus Pragmatics . https://doi.org/10.1007/s41701-026-00237-5 Taye, M. M. (2023). Understanding of machine learning with deep learning: architectures, workflow, applications and...

work page doi:10.1007/s10462-025-11332-5 2026
[9]

M., & Mathur, S

Thomas, D. M., & Mathur, S. (2019). Data analysis by web scraping using python. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 450-454). IEEE. Turing, A. M. (1950). Can Machines think. Mind, 59(236), 433-460. Ullah, N., Khan, H., Qazi, A. M., Zeb, S., & Rabi, U. (2025). Emojis as a semiotic system:...

2019
[10]

Y ., Katiki, R., Huang, K., Wilson, S., & Dudas, P

Venkit, P., Karishma, Z., Hsu, C. Y ., Katiki, R., Huang, K., Wilson, S., & Dudas, P. (2021). A Sourceful Twist: Emoji prediction based on sentiment, hashtags and application source. arXiv preprint arXiv:2103.07833. Winograd, T. (1971). Procedures as a representation for data in a computer program for understanding natural language. Unpublished PhD Thesis...

work page doi:10.1590/1982-4017-140304-0414 2021