AI programs often exclude African languages. These researchers have a plan to fix that.

Close-up of hand typing computer coding on laptop screen
African languages are severely underrepresented in services like Alexa, Siri, and ChatGPT.

There are over 7,000 languages throughout the world, nearly half of which are considered either endangered or extinct. Meanwhile, only a comparatively tiny number of these are supported by natural language processing (NLP) artificial intelligence programs like Siri, Alexa, or ChatGPT. Particularly ignored are speakers of African dialects, who have long faced systemic biases alongside other marginalized communities within the tech industry. To help address the inequalities affecting billions of people, a team of researchers in Africa are working to establish a plan of action to better develop AI that can support these vastly overlooked languages.

The suggestions arrive thanks to members of Masakhane (roughly translated to “We build together” in isiZulu), a grassroots organization dedicated to advancing NLP research in African languages, “for Africans, by Africans.” As detailed in a new paper published today in Patterns, the team surveyed African language-speaking linguists, writers, editors, software engineers, and business leaders to identify five major themes to consider when developing African NLP tools.

[Related: AI plagiarism detectors falsely flag non-native English speakers.]

Firstly, the team emphasizes Africa as a multilingual society (Masakhane estimates over 2,000 of the world’s languages originate on the continent), and these languages are vital to cultural identities and societal participation. There are over 200 million speakers of Swahili, for example, while 45 million people speak Yoruba.

Secondly, the authors emphasize that developing the proper support for African content creation is vital to expanding access, including tools like digital dictionaries, spell checkers, and African language-supported keyboards.

They also mention multidisciplinary collaborations between linguists and computer scientists are key to better designing tools, and say that developers should keep in mind the ethical obligations that come with data collection, curation, and usage.

“It doesn’t make sense to me that there are limited AI tools for African languages. Inclusion and representation in the advancement of language technology is not a patch you put at the end—it’s something you think about up front,” Kathleen Siminyu, the paper’s first author and an AI researcher at Masakhane Foundation, said in a statement on Friday.

[Related: ChatGPT’s accuracy has gotten worse, study shows.]

Some of the team’s other recommendations include additional structural support to develop content moderation tools to help curtail the spread of online African language-based misinformation, as well as funding for legal cases involving African language data usage by non-African companies.

“I would love for us to live in a world where Africans can have as good quality of life and access to information and opportunities as somebody fluent in English, French, Mandarin, or other languages,” Siminyu continues. Going forward, the team hopes to expand their study to feature even more participants, and use their research to potentially help preserve indigenous African languages.

“[W]e feel that these are challenges that can and must be faced,” Patterns' scientific editor Wanying Wang writes in the issue’s accompanying editorial. Wang also hopes additional researchers will submit their own explorations and advancements in non-English NLP.

“This is not limited just to groundbreaking technical NLP advances and solutions but also open to research papers that use these or similar technologies to push language and domain boundaries,” writes Wang.