It Applies to gAI, Too

Dave Schuler May 28, 2026

In an earlier post I had said that I was planning to elaborate on some of the ways that those who have developed artificial intelligence models like ChatGPT, Claude, and so on may not understand their own creations. This is that post.

The Sapir-Whorf hypothesis is simple to state but far-reaching in its implications. It is that language affects cognition. It comes in two forms, the strong Sapir-Whorf hypothesis, language constrains thought, or the weak form, language makes thinking in certain ways easier. There is substantial evidence that language influences cognition, particularly in perception, categorization, and memory and it has been widely accepted for nearly 60 years. It is the weak form to which I will refer in the balance of this post.

The key point is that It applies to generative artificial intelligence, too. Algorithms designed by native speakers of English and trained using content written in the English language will differ from algorithms designed by Mandarin, Yoruba, or Arabic speakers and trained using Mandarin, Yoruba, or Arabic content will differ in ways we can’t even predict. They can still be grammatically correct or be intelligible but they’ll feel “off”. They won’t uniformly make intuitive sense. Although English and German are closely related linguistically, models created by native speakers of German and optimized for German users will be subtly different from models created by native speakers of English and optimized for English-speaking users.

That won’t be a dealbreaker but it will impair their commercial viability, particularly in certain fields like medicine or law. I strongly suspect that this is an aspect that the developers of gAI models are not sensitive to. There is no cognitive layer on which the linguistic model resides. It’s the other way around. It also may not be something that American managers are sensitive to but that doesn’t mean that it doesn’t apply. It just means that models developed by non-English speakers will not be as useful to native speakers of English as they might be.

The precise linguistic mechanisms are beyond the scope of this post; the important point is simply that different languages encode meaning, context, categorization, and association differently, and models trained on those languages will necessarily reflect those differences. A model can produce flawless English and communicate clearly but still not feel fully native in its assumptions or reasoning patterns. That has serious implications for commercial utility.

That’s why, for example, I don’t worry too much about China’s competing with the U. S. in gAI. I have every confidence in the ability of native speakers of Mandarin to develop their own generative AI models. Models trained primarily on English-language content and optimized for English-speaking users will differ from models trained primarily on Mandarin-language content and optimized for Mandarin-speaking users. Is there enough Mandarin language content on which to train them? I don’t know. Indeed, I doubt that anyone knows. And if the models are train, again, using English language content the results will be subtly different. Whether these differences arise primarily from language structure itself or from the cultures embedded in the training corpora may not matter operationally; the resulting models will still exhibit different intuitions and priorities.

The Chinese are absolutely capable of developing their own models running on their own graphics processor units (GPUs). At the present state of technology those GPUs will require about twice as much electricity as GPUs built in Taiwan or the U. S.

The same is true of developing generative AI models in other languages as well, i.e. Farsi, Arabic, Tagalog, and so on. But the Sapir-Whorf hypothesis will continue to apply.

6 comments… add one

CuriousOnlooker Link

One mistake is to think these models “training data” is limited similarly to how humans are limited.

The SOTA models are trained on the corpus of the whole internet — which includes all languages. And also all the text in all books available publicly.

Those contain more textual data for non-english languages than a human could ever consume.

The training data shortfalls are completely different; like shortage of visual, audio data — those are left to completely different models/products.
PD Shaw Link

Brendan Foody (Mercor) has indicated that law is one of the more difficult areas to train AI because it involves a lot of “taste,” by which I believe he means subjective elements that don’t necessarily fit to a rubric. Also he’s said that there are areas of law in which the right way to do something is not written down or codified, but is in the heads of experts. He seems to be on a voyage of discovery, but confident that hiring experts will eventually resolve the issues. Perhaps it will, if its truly experts that have the knowledge, instead of practitioners. Experts sound like the view from the top down, as opposed to lawyers working at the ground level.
Dave Schuler Link

CuriousOnlooker:

Shorter: use agentic AI. The more generative AI is trained on multi-lingual slop, the less intuitively correct its results will become. That will result in a progressive loss of utility as the amount of slop increases, which it inevitably will.
Drew Link

Just musing. As the difference between a language and dialects is somewhat subjective, I wonder how sensitive the issue you cite is to dialects. Down here they might think AI programs developed in NY are as crooked as a dogs hind leg…..
Dave Schuler Link

It depends on what is meant by “dialect”. In China and the Arab world what are referred to as “dialects” are actually different languages and may have different cognitive implications. In the United States, for example, whether Southern dialect or New England dialect, they are still mutually comprehensible, actual dialects, and have the same cognitive effects.
Drew Link

That’s why I used the word “subjective.”

I understand your point, but I actually had read a few weeks ago about this whole issue. The authors were not as definitive as you. I am not a linguist, or AI designer, so………….musing.