Russian Software Engineers Teach Robot Arm to "Think" with AI

It can already distinguish and sort objects by color in the desired location

Scientists from MIPT, the Artificial Intelligence Institute AIRI, and the Federal Research Center "Informatics and Management" of the Russian Academy of Sciences have developed a method for controlling a robotic system that performs its actions based on text instructions and visual information.

The robot arm has learned to distribute cubes on the table by color and distribute them to a specified area

Further development of this methodology will allow creating robots for autonomous execution of complex multi-step operations without human intervention. So far, no one in the world has succeeded in this, but prototypes of such robots are being developed both in Russia and in other countries around the world.

As noted in the MIPT press service, the methodology is based on the architecture of a bimodal transformer. It was already initially trained in a number of skills: text translation, answering questions about an image, generating images, and a number of others.

When a new modality was added to control the robot, the robotic system was able to navigate in an unfamiliar environment and independently determine the algorithm of actions to solve the task. A scientific paper describing the method was published in the international journal IEEE Access.

MIPT notes that the model in the scientific work was a robot arm with six degrees of freedom. It had to sort objects on the table by color and assemble them in a given area. The robot arm chose its action based on text instructions and data from video cameras.

Video: MIPT press service

According to the developers, the principle of operation of the manipulator training algorithm "resembles the GPT model", but the developed model, instead of text, produces a sequence of actions for the robot. After each action, the electronic computing device that controls the robot arm receives feedback from video cameras and then plans a new action.

The novelty of the work is that we used ready-made language models to train the robot - algorithms that help translate natural speech into code that is understandable to the control system. They are neural networks that are pre-trained on large amounts of text data. In our case, the RozumFormer multimodal model was used. Unlike others, it can generate a response to both text queries and those made in the form of images.
Co-author of the work, junior researcher at the Federal Research Center "Informatics and Management" of the Russian Academy of Sciences and researcher at AIRI, Alexey Kovalev

RozumFormer was retrained so that it could "understand" the colors of the cubes, the distances to them, and other parameters of the surrounding reality, and it began to control the manipulator. Step-by-step adaptation prepared the neural network so that, receiving feedback from video cameras, it could independently plan further actions and solve the tasks assigned to it based on the learned algorithms. And it succeeded.

Now the scientists face the task of training the model to remember chains of longer actions. Subsequently, in the future, this will help robots without a person to cope with actions where a non-standard approach for a robotics system and an instant assessment of the situation are needed. For example, in the simplest version for a robot assistant, this is washing dishes, cleaning and sorting objects by room and by purpose during cleaning.