It's All About the 'Bayanat'
AI's Arabic Problem
Introduction
As generative AI capabilities have matured in English, AI companies and their customers are turning their attention to non-English markets, including the Arab world. Many international companies expanding into Arab markets, however, are not aware that Arabic is different from most of the other languages they are used to working with. To obtain good Arabic data (bayanat بيانات), it is imperative to be aware of the linguistic situation in the Arab world. The most important thing you need to know is that there is not just one “Arabic,” but rather several “Arabics.” The language is split between the written language (Standard Arabic), and the spoken language or “dialect”. And the spoken language varies widely from one country to another.
Many Varieties Lead to Data Scarcity
The many varieties of Arabic and lack of standardization within the national dialects means that – despite being the sixth most spoken language in the world – Arabic suffers from data scarcity. This data scarcity has serious ramifications for the data that goes into AI models. The data starts out sparse because only 0.5% of the internet is written in Arabic, as opposed to 49.5% in English. Then this tiny percentage in “Arabic” needs to be further divided into varieties of Arabic, since a model trained with one variety of will not necessarily be useful when faced with a different variety of Arabic. In other words, a model trained with mostly Standard Arabic will not perform well on Tunisian Arabic. Training data scarcity for the national dialects is further exacerbated by the fact they don’t have the formal written corpora (news archives, Wikipedia pages, etc.) that form a critical part of most LLM’s training data.
Arabic and AI Performance
The effect of this scarcity on model performance is clear. AI models, from OCR to chatbots to translators, perform significantly worse in Arabic than they do in English, or in well-resourced languages like German and French. The performance of many models in the national dialects is even worse. I recently reviewed some output from OpenAI’s ASR system Whisper for Arabic: the output for Standard Arabic was largely understandable, though far from error-free. Much of the output for Saudi Arabic, however, could fairly be described as “word salad.”
A further problem is that a chatbot that is primarily trained on Standard Arabic will chat with users in a socially inappropriate way, leading to conversations that are awkward or unnatural. For example, if I ask an LLM a question in Tunisian Arabic, it would be natural and appropriate for the LLM to respond in Tunisian Arabic. This is not what happens, though, even with sophisticated multilingual LLMs like ChatGPT. Aside from a few words of greeting in Tunisian Arabic, the bot will switch to using almost entirely Standard Arabic, even for very personal topics. This makes the tone of the bot cold and official (like a government statement or news broadcast), rather than friendly and personable. It is no coincidence that, while most writing is in Standard Arabic, advertisers write their ads in dialect: Standard Arabic does not speak to the heart.
The models behave this way because much of the text used for training AI systems is from formal text such as news sources and Wikipedia. In Arabic, these kinds of sources are almost all written in Standard Arabic. This means that the NLP resources available for Standard Arabic are fairly extensive (it is considered a “high resource” language by some measures), while those available for the dialects are much more meager.
The “Arabic-Speaking” Human in the Loop
The difficulties that Arabic poses to AI affects not only the initial model training, but also later “human-in-the-loop” training, like supervised fine tuning and RLHF. The reason is that it makes it more complex to hire the right humans for your loop. Companies often hire Arabic annotators, transcribers, raters, and content creators with the requirement that they be a “native speaker” of Arabic. Yet this term is not as transparent for Arabic as it is for other languages.
If the target language is Standard Arabic, there are no “native” speakers: nobody grows up speaking Standard Arabic; they have to learn it at school. So the qualification needs to be based on educational level. This is clear from a recent project I worked on, where I was reviewing an assessment that had been created to test applicants’ Arabic proficiency. The assessment was written in Standard Arabic by a college-educated, “native speaker” contractor, yet was full of grammatical mistakes. An appropriate hire for such a task would be not only college educated, but with a degree specifically in Arabic language and literature and, preferably, experience teaching Arabic.
If, on the other hand, the target language is one of the spoken “dialects,” native proficiency will be sufficient qualification — but only for the target dialect. A native speaker of one variety will not necessarily be proficient in another. To give a hypothetical example: if you are hiring for an RLHF project involving Saudi internet videos, and you hire an Egyptian rater who has passed a Standard Arabic assessment, you have no guarantee that that rater will be proficient in Saudi Arabic. In fact, unless the rater has previous experience or exposure to Saudi Arabic, it is most likely that they will only partially understand the dialect. If the project is in Moroccan Arabic, it’s likely they won’t understand it at all.
Companies who are not aware of this will end up with bad data, and bad models.
What Makes Arabic Different?
All of this complication is caused by the fact that Arabic is a “diglossic” language (literally, ‘two tongued’). This means that there are multiple forms of the language in use at the same time: the written language (Standard Arabic), and the spoken varieties. Standard Arabic is the formal language used in writing, official communication, and media across the Arab world. It is uniform and standardized; it is also the language of education in every Arab country.
Standard Arabic, however, is not the “mother tongue” of any Arabic speaker. It is learned at school, similar to a foreign language, rather than acquired at home from parents and peers. In fact, it has not been a naturally spoken language for at least a thousand years. Instead, Arabs speak the local vernacular of their country: Iraqi, Tunisian, Egyptian, etc. Although most literate people in the Arab world understand Standard Arabic, it is not common or natural for them to speak it in everyday situations.
Many a foreign student who has studied “Arabic” for several years has found themselves unable to hold even a simple conversation in the Arab world. This is because all the spoken varieties differ significantly from Standard Arabic in vocabulary, pronunciation, and grammar. What’s more, there are also large differences between national spoken varieties.
The national dialects of Arabic are often referred to as “spoken varieties of Arabic”, but they are now quite commonly written as well, thanks to the internet and mobile phones. However, there is no standardized spelling for writing vernacular Arabic, so many common words within a dialect can be spelled in several different ways. This increases the variability of written Arabic even more.
Of course, variation is not a problem for Arabic alone. When we curate datasets for multilingual generative AI projects, we generally don’t source data in “Spanish”, we source “ES-ES,” “ES-MX,” etc., recognizing that the Spanish in Spain is not the same as that in Mexico. But the gulf between different varieties of Arabic is much larger than between Spanish dialects. It is not just accent and vocabulary, but also the core areas of the grammar and the most frequent vocabulary in the language(s). The scale of these differences makes the varieties, in some cases, mutually unintelligible.
What’s the Solution?
The upshot of all this is simply: know what you’re getting into when you start working with Arabic. Make sure that the people who are doing the hiring understand the complexities of the language (feel free to send them this post!) and are hiring the right people for the right kind of Arabic. Which means knowing what kind of Arabic you’re targeting, as well. Working with legal documents? Standard Arabic is what you need. TikTok videos? You’ll need to know where the videos are coming from and what dialect of Arabic is spoken there.
There are several different ways to classify the many varieties of Arabic, but this is what I like to use:
- Standard Arabic (written/formal language – not a mother tongue)
- Gulf Arabic (Saudi, Kuwaiti, UAE, Bahrain)
- Iraqi Arabic
- Levantine Arabic (Lebanon, Syria, Jordan, Palestine)
- Egyptian Arabic (Egypt, Sudan)
- Tunisian Arabic (Tunisia, Libya, Eastern Algeria)
- Moroccan Arabic (Morocco, Western Algeria)
If you expect to have ongoing work with Arabic, you might want to consider hiring someone on staff who has experience with the language. They will be able to tell one kind of Arabic from another; they’ll also be able to tell you when your app is outputting Arabic that’s disconnected and backwards (which happens surprisingly often!) At Innodata, I make sure that I consult on all our new projects involving Arabic – this has saved our clients from many costly mistakes.
This may all seem very complicated, but creating products that function well in Arabic is well worth the cost. At the end of the day, understanding the nuances of Arabic isn’t just about overcoming data scarcity – it’s about building AI that genuinely speaks to the region.
Bring Intelligence to Your Enterprise Processes with Generative AI
Whether you have existing generative AI models or want to integrate them into your operations, we offer a comprehensive suite of services to unlock their full potential.
follow us