New Research Shows How AI Language Models Favor Some Languages Over Others

A new study reveals that current AI models favor languages like English and French, making them less effective and more expensive for speakers of underrepresented languages. Researchers propose new methods to make these models more fair and efficient for everyone.

Researchers from ArXiv cs.CL published a study on tokenizers for multilingual large language models (LLMs). They found that current models, which use Byte-level Byte-Pair Encoding (BPE) tokenizers, favor high-resource languages like English and French. These tokenizers break down text into smaller pieces, but they do so in a way that makes it harder for the AI to understand and generate text in underrepresented languages, especially those from Southeast Asia.

This bias isn't just about fairness—it also makes these models more expensive to run for speakers of underrepresented languages. The study presents the first systematic comparison of equitable tokenizers on a unified benchmark, suggesting that new, more equitable tokenizers could help close this gap, making AI language models work better and cost less for everyone, regardless of the language they speak.

If you're curious about how this affects you, try using a multilingual AI tool like Google Translate or DeepL. Notice how some languages might feel more natural or accurate than others. This study highlights why that happens and how it could be improved in the future.