Despite the rapid advancements in large language model (LLM) capabilities, a significant “language gap” persists, leaving the vast majority of the world’s over 7,000 languages underserved and creating critical disparities in AI safety. A new preprint paper by researchers primarily from Cohere and Cohere Labs meticulously outlines why this gap exists, how it’s widening, and its profound impact on global AI safety, urging policymakers and the AI community to take concerted action.
The paper, titled “The Multilingual Divide and Its Impact on Global AI Safety,” authored by Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Sara Hooker, and a team of collaborators, was released as a preprint on May 28, 2025. It serves as both an analysis and a call to action, highlighting that AI models predominantly reflect English and Western-centric viewpoints, thereby marginalizing other cultural perspectives and introducing safety vulnerabilities for non-English speakers.
Quoting philosopher Ludwig Wittgenstein, “The limits of my language means the limits of my world,” the report underscores how current AI systems, by primarily focusing on a handful of globally dominant languages, are inadvertently limiting the worlds of billions. This isn’t just an issue of accessibility; it’s a critical safety concern that is often overlooked in mainstream AI safety initiatives.
The research delves into the multifaceted reasons behind the current state of multilingual AI. The language gap is not a simple oversight but a result of systemic issues:
This English-centric development means that even the underlying “concept space” of many LLMs is more aligned with English, potentially leading to biases against other cultural perspectives.
The report warns that the language gap risks creating a vicious cycle. Advanced techniques like using synthetic data (AI-generated data for training other AIs) and LLM-based evaluations inherently favor languages already well-supported, further marginalizing low-resource languages. This has several critical consequences:
The paper draws extensively on the experiences and learnings from Cohere Labs’s Aya initiative, a large-scale participatory machine learning research project involving over 3,000 collaborators from 119 countries. Aya aims to increase access to state-of-the-art AI for all languages. Its inaugural Aya 101 release significantly expanded language coverage in AI, providing a massive multilingual instruction fine-tuning dataset.
Key lessons highlighted from the Aya project include:
The paper articulates three overarching barriers that hinder efforts to close the language disparity and associated AI safety gaps:
To address these challenges and foster a more inclusive and safe AI ecosystem, the researchers offer specific recommendations for policymakers and the broader research community:
All Rights Reserved. Copyright , Central Coast Communications, Inc.