Average Russian Proves Smarter Than AI: SFU Scientists Conduct Study Using Archive of "What? Where? When?" Game

LLM Erred on Questions That Are Trivial for Russians

Researchers from the Southern Federal University (SFU) have proposed an unusual way to assess the level of artificial intelligence development — by making it answer questions from the TV quizzes "What? Where? When?" and "Own Game." It turned out that even the most advanced language models (LLM) cope with such tasks worse than the average Russian.

The main problem of modern LLMs, including ChatGPT and LLaMa-3, is the lack of high-quality data in Russian. As Bogdan Protsenko, the project manager at SFU, explained, foreign models are often trained on English-language datasets, and Russian-language content is either translated or presented in a smaller volume.

There is noticeably less Russian, for example, in the pre-training data of all foreign models than English, therefore the model writes and thinks more "smartly" and competently if you ask it in English and ask it to answer in it. Benchmarks, the "rulers" by which the quality of models and their work in different languages ​​are measured, are usually just a translation from one of the languages, usually English, to others. This alignment does not reflect the quality of the model in a real language situation.
Bogdan Protsenko, responsible executor of the project "Frontier Laboratory of X-ray Spectral Nanometrology" of the Center for Science-Intensive Instrumentation of SFU 

Questions from "What? Where? When?" require not only erudition, but also an understanding of the cultural context, wordplay, and logical connections. The LLM, which has 405 billion parameters and is capable of understanding dozens of natural and programming languages, as well as understanding a wide variety of fields of knowledge, from quantum mechanics to medicine, made mistakes in questions about the composer Vladimir Shainsky or the Tsar Cannon — topics obvious to Russian-speaking people.

Mikhail Levandovsky, a four-time world champion in the game "What? Where? When?", noted that the main feature of this game is its variability. At the beginning of the history of the game "What? Where? When?", the key to success was the ability to recognize "phenomena" — abstract images and social patterns. For modern artificial intelligence systems, this still presents a serious problem.

Scientists experimented with different methods of generating answers, including the "chain of reasoning" and internal self-criticism of the model. It turned out that an approach that mimics discussion in a team of experts improves accuracy, but sometimes suppresses AI creativity.

It will definitely be easier for artificial intelligence to solve questions from "Own Game", since they are usually aimed at the erudition and personal knowledge of the player, but questions from "What? Where? When?" are more about the ability of a team of experts to think and guess. To answer the average question of "CHGK", a Russian person usually does not need any special knowledge beyond the school curriculum and general culture, but another thing is that often the questions are "wrapped up" in such a way that only a few will understand what is being said. If the community of authors of "CHGK" questions learns that AI has learned to answer their questions easily, this will motivate them to twist new questions so that AI has no chance, while the complexity will remain the same for experts.
Alexey Paevsky, science journalist, popularizer of science and lecturer, participant in "Own Game" and author of questions for "What? Where? When?"

Artificial intelligence (AI) cannot yet compare with humans in its ability to generate new ideas and find non-standard solutions. Although AI can answer questions for which there is already a correct answer, it is still not capable of creative thinking and creating something new. Until large language models are trained on Russian-language data, they will be inferior to us even in quizzes.

Read more materials on the topic:

Russian AI accelerator with a record performance of 960 TOPS presented by "ХайТэк" company

AI-system for monitoring fatigue created by Russian scientists

Academician of the Russian Academy of Sciences: Russia should focus on applied AI models