In an earlier post I had said that I was planning to elaborate on some of the ways that those who have developed artificial intelligence models like ChatGPT, Claude, and so on may not understand their own creations. This is that post.
The Sapir-Whorf hypothesis is simple to state but far-reaching in its implications. It is that language affects cognition. It comes in two forms, the strong Sapir-Whorf hypothesis, language constrains thought, or the weak form, language makes thinking in certain ways easier. There is substantial evidence that language influences cognition, particularly in perception, categorization, and memory and it has been widely accepted for nearly 60 years. It is the weak form to which I will refer in the balance of this post.
The key point is that It applies to generative artificial intelligence, too. Algorithms designed by native speakers of English and trained using content written in the English language will differ from algorithms designed by Mandarin, Yoruba, or Arabic speakers and trained using Mandarin, Yoruba, or Arabic content will differ in ways we can’t even predict. They can still be grammatically correct or be intelligible but they’ll feel “off”. They won’t uniformly make intuitive sense. Although English and German are closely related linguistically, models created by native speakers of German and optimized for German users will be subtly different from models created by native speakers of English and optimized for English-speaking users.
That won’t be a dealbreaker but it will impair their commercial viability, particularly in certain fields like medicine or law. I strongly suspect that this is an aspect that the developers of gAI models are not sensitive to. There is no cognitive layer on which the linguistic model resides. It’s the other way around. It also may not be something that American managers are sensitive to but that doesn’t mean that it doesn’t apply. It just means that models developed by non-English speakers will not be as useful to native speakers of English as they might be.
The precise linguistic mechanisms are beyond the scope of this post; the important point is simply that different languages encode meaning, context, categorization, and association differently, and models trained on those languages will necessarily reflect those differences. A model can produce flawless English and communicate clearly but still not feel fully native in its assumptions or reasoning patterns. That has serious implications for commercial utility.
That’s why, for example, I don’t worry too much about China’s competing with the U. S. in gAI. I have every confidence in the ability of native speakers of Mandarin to develop their own generative AI models. Models trained primarily on English-language content and optimized for English-speaking users will differ from models trained primarily on Mandarin-language content and optimized for Mandarin-speaking users. Is there enough Mandarin language content on which to train them? I don’t know. Indeed, I doubt that anyone knows. And if the models are train, again, using English language content the results will be subtly different. Whether these differences arise primarily from language structure itself or from the cultures embedded in the training corpora may not matter operationally; the resulting models will still exhibit different intuitions and priorities.
The Chinese are absolutely capable of developing their own models running on their own graphics processor units (GPUs). At the present state of technology those GPUs will require about twice as much electricity as GPUs built in Taiwan or the U. S.
The same is true of developing generative AI models in other languages as well, i.e. Farsi, Arabic, Tagalog, and so on. But the Sapir-Whorf hypothesis will continue to apply.







One mistake is to think these models “training data” is limited similarly to how humans are limited.
The SOTA models are trained on the corpus of the whole internet — which includes all languages. And also all the text in all books available publicly.
Those contain more textual data for non-english languages than a human could ever consume.
The training data shortfalls are completely different; like shortage of visual, audio data — those are left to completely different models/products.