New AI Research Aims to Fix Language Barriers in Factual Knowledge

Researchers created a multilingual dataset to improve AI's ability to share facts across languages. This could make AI assistants more reliable worldwide.

Researchers introduced PolyFact, a new dataset with 100,000 facts grounded in Wikidata across 12 typologically diverse languages, to help AI models share accurate information across languages. Currently, AI models trained mostly on English often struggle to provide consistent facts in other languages—a phenomenon called cross-lingual factual inconsistency. This dataset aims to fix that by teaching AI models to reliably express knowledge in multiple languages.

The study compares three training approaches: light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO). The goal is to determine which method best reduces factual inconsistencies across languages.

This research matters because it could make AI assistants like Siri or Google Assistant more useful for non-English speakers. Imagine asking your AI assistant a question in Spanish and getting the same accurate answer you'd get in English. This could bridge language gaps and make AI more accessible globally.

To see this in action, try asking a multilingual AI assistant like Google Assistant a factual question in a language you know. Compare the answers in different languages to see how consistent they are. You might notice differences that this research aims to fix.