PREPROCESSING ARABIC DIALECT FOR SENTIMENT MINING: STATE OF ART

Sentiment Analysis concerns the analysis of ideas, emotions, evaluations, values, attitudes and feelings about products, services, companies, individuals, tasks, events, titles and their characteristics. With the increase in applications on the Internet and social networks, Sentiment Analysis has become more crucial in the field of text mining research and has since been used to explore users’ opinions on various products or topics discussed on the Internet. Developments in the fields of Natural Language Processing and Computational Linguistics have contributed positively to Sentiment Analysis studies, especially for sentiments written in nonstructured or semi-structured languages. In this paper, we present a literature review on the pre-processing task on the field of sentiment analysis and an analytical and comparative study of different researches conducted in Arabic social networks. This study allowed as concluding that several works have dealt with the generation of stop words dictionary. In this context, two approaches are adopted: first, the manual one, which gives rise to a limited list, and second, the automatic, where the list of stop words is extracted from social networks based on defined rules. For stemming two, algorithms have been proposed to isolate prefixes and suffixes from words in dialects. However, few works have been interested in dialects directly without translation. The Moroccan dialect in particular is considered as the 5th dialect studied among Arabic dialects after Jordanian, Egyptian, Tunisian and Algerian dialects. Despite the significant lack in studies carried out on Arabic dialects, we were able to extract several conclusions about the difficulties and challenges encountered through this comparative study, as well as the possible ways and tracks to study in any dialects sentiment analysis pre-processing solution.


INTRODUCTION
Nowadays, social networking has become in some ways one of the most popular communication tools. Social network environments are used by people of all ages, cultures and social categories to convey variant messages and can reach a global audience. Several platforms on the Web and social networks like Facebook, Twitter… allow peoples to convey opinions, share experiences or simply talk about everything about them online (Tan et al., 2011).The monitoring of social media has become an important way to analyse and detect trends, by studying and evaluating opinions on various topics such as politics (Eason et al., 1995), services (teachings, health...), marketing and business products.
People can share their opinions in an environment without constraint and, companies can extract useful ideas for their decision-making process. To quantify what individuals think from textual qualitative data, a polarity classification task for detecting positive, negative, or neutral text is required. Although, there is a large amount of research available on the analysis of documents such as newspaper, articles, journals, there are still several open questions to tackle concerning the real nature of the messages available online social networks.
Actually, many works are devoted to sentiment analysis from textual data over structured languages. However, much less effort is committed to providing a precise classification of sentiment for unstructured languages in general and more specifically for the Moroccan dialect "Darija". In our last work  we carried out a state of art and a comparative study of researches done in recent years on sentiment analysis. In addition to other findings, we deduced that, most of researches translate the comments in a structured language such as English to analyse them and that there are no standard resources for unstructured languages, such as Moroccan Darija (MDL). Given the fact that comments in Darija can be written in Arabic characters, Latin characters or a mixture of the two, this makes automatic processing more difficult to achieve.
In fact, user-generated content on the Web is generally unstructured and needs important pre-processing steps and analysis to extract useful knowledge (Melville et al., 2019). These steps depend on the nature of the language (structured or unstructured) and generally are different from one research to another. Their objective is to clean, normalize, transform and reduce the data size in order to adapt it to the learning algorithm.
The complete process of sentiment analysis includes data collection steps, pre-processing of the text, sensing of sentiment and its classification. Nevertheless, the pre-processing step is the most important in the analysis of feelings because messages in the social networks are characterized by colloquial expressions, abbreviations, emoticons, a lengthening of words, irregular capital letters, and do not generally conform to the canonical grammatical rules.
The objective of this work is to develop a comparative and statistical study of researches in Arabic Dialect for sentiment analysis. The paper is organized as follows: Section II develops the sentiment analysis background. Section III presents the related works. Section IV discusses our comparative study and the next one exposes the statistical analysis. Finally, the conclusion and the future works are detailed in section VI.

Pre-processing task Major steps
The pre-processing is a very important phase in sentiment analysis. It allows for quality data to be obtained and ensures a better performance in this analysis. This phase can be carried out in several stages, which depend on the nature of the language and the analysis objectives. The main stages are: Data cleaning: one of the most important tasks in successfully mining social networks is cleaning up noisy data (Gharatkar et al., 2017).
Stop words removal: activity for removing words that are used for structuring language but do not contribute in any way to its content. Some of these words are a, are, the, was… Tokenization: task for separating the full text string into a list of separate words. This is simple to perform in space-delimited languages such as English, Spanish or French, but becomes considerably more difficult in languages where words are not delimited by spaces like in Japanese, Chinese and Thai.
Stemming: heuristic process for deleting word affixes and leaving them in an invariant canonical form or "stem". For instance, person, person's, personify and personification become person when stemmed. The most popular English stemmer algorithm is Porter's stemmer.
Lemmatization: algorithmic process that brings a word into its non-inflected dictionary form. It is analogous to stemming but is achieved through a more rigorous set of steps that incorporate the morphological analysis of each word.
Despite the identification of these different stages, preprocessing phase is confronted by several problems, which are related to the sentiment analysis context. Indeed, words belonging to different parts of speech must be treated according to their linguistic role (adjective, nouns, verbs, etc.). The word style (bold, italic and underline) is not always available on online social media platforms and is often replaced by some language conventions. The lengthening of words like "it's seeeeeerious" (commonly known as expressive elongation or word stretching) is an example of new language conventions that are today very popular on online platforms. Other problems are related to additional terms such as the abbreviation expressions that are additional paralinguistic elements used in non-verbal communication (Lui et al., 2012) on online social networking platforms. The hashtags, which are widely used on online social networks to express one or more specific feelings. The distinction between sentiment hashtags and subject hashtags is a challenge that must be properly addressed for polarity classification and the emoticons, which are introduced as non-verbal expressive components in the written language to reflect the role played by facial expressions in speech.
Another very important pre-processing challenge is having to detect and analyze the uppercase letters given that positive and negative expressions are commonly reported by the uppercase of certain specific words (for example, ' #StarWars was UNBELIEVABLE! ') to express the intensity of the user's feelings.

Arabic dialect pre-processing challenges
The pre-processing phase is faced by several problems and challenges. These challenges are more important in the case of feelings written in unstructured languages. Sentiment analysis for Arabic dialects in general and Moroccan Darija in particular suffers from several complex problems, related to its nature, such as: • Replacement of the kasra ("i" vowel/sound as in liberty) by the sukun (diacritic that marks the absence of a vowel) at the beginning of a word, as in ‫كتاب")‬ ktAb") instead of ‫كتاب‬ )" kitAb") in Modern Standard Arabic (MSA); • Bypassing or avoiding the Hamza to be pronounced as a "ya" sound, e.g. "3A'ilah" (family) is uttered "3Aylah"; • Some pronouns are slightly modified from their MSA form.
• The negation is introduced by means of the word ‫ما‬ ») mA ») and the suffix ْ ‫ش‬with the sukun on it (« sh ») as in ‫ماكليتش‬ .Negation also has some other expressions such as ‫)مارانيش‬ mA rAnysh, "I am not") or ‫ماراناش‬ ‫ما‬ ) mA rAnAsh,"We are not")… • As mentioned above, a number of words that are not of Arabic origin have percolated into Moroccan dialect, such as ‫طابلة‬ ("TAblah") for Table from French, ‫كارطابل‬ ‫ي‬ ") Kartably") for "my schoolbag", from French... • Most often, words of non-Arab origin are conjugated using the rules applied to those of Arabic origin. For instance, ‫طابلة‬ ") TAblah" for Table) gets a plural form as ‫)"طبالت‬ TAblAt" for Tables) following the regular plural of Arabic feminine nouns (which the form used for all nouns of foreign origin) (Al ayyoub et al., 2019 ). Verbs of foreign origin are also conjugated as if they were of Arabic origin. For instance, ‫)"مافرناتش‬ ma franatch", "she did not pull the brake").

RELATED WORK
Several studies have been interested in sentiment analysis and a variety of approaches have been developed, especially for English language. However, research studies are more limited when it comes to other languages, such as Arabic. This section discusses research studies in the field of sentiment analysis for Arabic dialects.
Stops word detection and elimination is one of the major challenges of sentiment analysis in the context of non-structured languages. In MSA and Arabic Dialects, there is no general standard stop-word list to use. To overcome this lack, for instance, (Walaa et al., 2015) generated stop words list from Online Social Network (OSN) corpora like Twitter, Facebook…etc for Egyptian Dialect (ED) The methodology consists of three phases: calculating the words' frequency of occurrence, checking the validity of a word to be a stop word, and adding all possible prefixes and suffixes to the words generated. (Alajmi et al., 2012) generated Arabic language stop words list. The list generation involves various important factors like word frequency calculation, mean and variance calculation, Entropy calculation, and Borda's ranking. (Khouja et al., 1999) created her Arabic stemmer with 168 words; this has been used by (Larkey et al., 2001;Larkey et al., 2002). A top-list minted through translating an English list and enhancing it with high frequency words from the corpus leading to a larger The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W3-2020, 2020 5th International Conference on Smart City Applications, 7-8 October 2020, Virtual Safranbolu, Turkey (online) 1.131-word list has been developed by Chen and Gey., 2001). Moreover, a dependent domain list, which includes three problems, has been created by (Savoy et al., 2002) Firstly, they have used a few words preceded by the letter waw ‫و"‬ "meaning "and" in 17 words together with 11 duplicates. In several words in the Arabic language, this letter comes in a different format and can come before the entire words in the language with no exceptions. One of the most appropriate methods to do this is to eliminate it through the use of an applicable and effective stemming algorithm. Secondly, they have deleted enormous single letters with the waw, specifically "ba' ‫ب"‬ ,"heh ‫"ه"‬ ."hamza "ْ ", alef ‫أ"‬ ‫ا,‬ ." Because of the nature of the written Arabic language, the aforementioned letters can come individually, but they are still considered a part of the word, so deleting them can change the meaning of the word or can make it meaningless, e.g. the word of ‫كتاب"‬ "that has several meanings such as writers, book, or a place of learning includes the letter ba' as a single separate letter, therefore removing it would make the word meaningless. Thirdly, a few words found are not considered stop-words although they have appeared several times in the corpus' statistics' analysis such as Casablanca " ‫الدار‬ ‫‪,"United‬البيضاء‬ ‫المتحدة"‬ ,"States ‫الواليات"‬ ,"etc. Additionally, it is considered a more dependent domain list, thus it is unlikely appropriate for other collections.  has categorized a group of Arabic hadiths into the so-called "Sahih AL-Bukhari", which is an 8-chapter book. It has been done by calculating the frequency of term. (Azmi et al., 2019) removed stop words using an algorithm based on the so-called "deterministic finite machine. The author recommended doing research work in the future on the impact of the steps of preprocessing such as stemming and stop-word generated from words of Hadiths. (Harrag et al., 2010) recommended deleting stop-words with high and low frequency words. As for Jbara and Khitam., 2010), he has manually helped in building a list of stop-words that includes Arabic prepositions, pronouns, names of people such as Prophet Mohammad's companions as well as places said in the corpus of hadiths. Likewise, (Harrag et al., 2014) conducted several operations that includes the elimination of stop-words required to delete the meaningless like definite articles and pronouns from the hadith Matn's text.
Other very important challenges of sentiment analysis for unstructured languages are faced in the normalization and stemming steps. Several research studies attempt to bring solutions to these problems. Indeed, (Abuata et al., 2015) developed an algorithm to remove the suffixes and prefixes of dialect words and also extract the stem of these words used in Arabian Gulf countries (Kuwait, Bahrain, Qatar, UAE, Saudi, Eastern Area, and South of Iraq). (Albogamy et al., 2016) proposed a new stemmer, for Arabic tweets, that does not rely on any root dictionary. They followed two phases: phase 1 is dedicated to producing a list of all possible stems by using the grammar, and phase 2 is selecting the shortest stem as the correct stem. They compared their stemmer with three Arabic stemmers and results showed that this one has the best accuracy. (Shoukry et al., 2012) tested the effect of pre-processing (normalization, stemming, and stop words removal) on the performance of an Arabic sentiment analysis system using Arabic and Egyptian tweets from Twitter. tested the effect of preprocessing on Saudi dialect sentiment analysis using Rapidminer. (Kanan et al., 2016) presented an algorithm containing a new set of rules for the Gulf dialects analysis. It concerns Kuwait, Bahrain, Qatar, the United Arab Emirates and parts of Saudi Arabia and some parts of southern Iraq. This new algorithm is able to handle all these Arabic dialects by defining new rules and their fusion with those currently used. It is also able to treat all non-Arabic words used in Arabic dialects. (Mulki et al., 2018) tested the impact of pre-processing techniques on sentiment analysis using three Tunisian datasets of different sizes and multiple domains. The results emphasize the positive impact of pre-processing phase in stemming, emoji recognition and negation detection tasks. (Alayba et al., 2018) studied the effect of pre-processing the text on the sentiment classification performance. Six methods of pre-processing were applied to the text: removing URLs, removing numbers, removing stop words, normalising repeated letters, normalising acronyms to their original, and normalising negative mentions. These methods were applied on five datasets and they evaluated using four classifiers. The study indicated that removing numbers, stop words, and URLs reduce the noise in the datasets. However, normalising negative words and acronyms improve the classification performance. The author of the paper applied the sentiment classification using three classes only, which are "positive", "neutral", and "negative".

Criteria
In this section we developed a comparative study of the most important works conducted on Arabic dialect sentiment analysis. The comparison criteria adopted are: • Year of the papers • Language: before Pre-processing and after Pre-processing • Dataset: Dataset size and Source.

•
Data cleaning that deal with the noisy data. • Normalization: which allow generating consistent word forms (normalizing repeated letters and Replace slangs, abbreviation, Emoticon...) • Stop words: major steps conducted to remove words that are used for structuring text but do not contribute in any way to its content. Some of these words are: a, are, the, was… • Stemming: heuristic process for deleting word affixes and leaving them in an invariant canonical form or "stem" Validation: a model validation parameter %.

Comparison
Several studies treated the Arab dialects using different ways and mythologies. The table below gives a comparative study of these works to make a comparison based on several criteria and draw conclusions.  Table 1 shows that most studies have realized the preprocessing steps, but in their own way. The goal is to reduce noise and improve the data quality for more accurate results.

STATISTICAL ANALYSIS AND DISCUSSION
30.8% of studies have chosen to translate dialects into structured language in order to benefit from the wealth of work and studies carried out on these structured languages and the ease of processing them.
The data-cleaning step remains roughly the same for all works, used to delete non-significant data such as URLs (52.4%), as well as punctuation (43.6%), hashtags (30.5%), special characters (51.3%) … The stop words elimination remains the most important step that requires a lot of effort when it concerns a dialect, unlike dealing with structured languages. Only 26.6% were able to perform automatic stop word deletion for the dialect by building a dictionary based on rules. These rules are often based on the repetition frequency of the word and its length.
After the noise is removed, the normalization step takes place. For Arabic dialects, the step that is often repeated is the normalization of Hamzaa & Alef & ya 7 (47.6%). On the other hand, the replacement of emoticons was only carried out by 19.04% of the works despite the subjective information which is contained in these emoticons and can help in the analysis of the feelings.
The multiple possibilities of writing a single word in a dialect make normalization more complicated. Therefore, most adopt the stemming for the structured part of comments and ignore the unstructured part. Only 33.3% provided an effort to develop rules in order to define the roots of words written in the Arabic dialects, using many techniques such as the phonetic regrouping and the similarity regrouping, which is the case for the analytical works done on the dialect of some Gulf countries and Algeria.
The detailed steps may differ from one analytical work to another, but the main parts of pre-processing remain the same. Sources of information are sometimes lost by deleting, for example, emoticons, hashtags and repeated characters, even if they represent information on the strength of word and the subjectivity of opinion.
The difficulty in the treatment of an Arabic dialect lies in the construction of the stop word dictionary automatically and the normalization by searching for the roots of the words, due to the richness of these dialects and the multiple possibilities of writing of the words in comments, whether in terms of language, word format or even Tatweel and repeated characters used sometimes to express feelings.
The lack of studies in this part also remains an important constraint, a challenge that is a driving force for effort to be made in order to further facilitate the pre-processing of these Arabic dialects.

CONCLUSION AND FUTURE WORKS
Sentiment analysis plays an essential role in decision-making in different fields such as politics, digital marketing (product and service evaluation), and for studying social phenomena. Because of its high value in practical applications, there has been an explosive growth in research in academia and applications in the industry.
However, there is a remarkable lack of pre-processing on unstructured languages such as the Arabic dialect "Darija as an example" even though these dialects represent a rich source of information given that they are the most used by the population on non-professional social networks.
This lack may be due to the difficulty of processing these languages, especially for stop word detection and stemming. This situation pushes us to take up the challenge and try to fill this gap in order to exploit a wealth that has not been exhausted.
Our future goals are first to benefit from the stop word detection techniques proposed in earlier works especially the automatic and semi automatic ones in order to develop stop word process detection for Moroccan Darija. Secondly, we propose a data cleaning process which takes into consideration the richness in feeling contained in certain content such as hashtags, emoticons ...and also to analyse in more detail the various works developed to realize stemming steps.