Scientists from MIPT, the AIRI Institute of Artificial Intelligence, and the Federal Research Center "Informatics and Management" of the Russian Academy of Sciences have developed a method for controlling a robotic system that performs its actions based on text instructions and visual information.
Further development of this methodology will allow the creation of robots for the autonomous execution of complex multi-step operations without human intervention. So far, no one in the world has succeeded in this, but prototypes of such robots are being developed both in Russia and in other countries.
According to the MIPT press service, the methodology is based on a bimodal transformer architecture. It was initially trained in a number of skills: text translation, answering questions about images, image generation, and a number of others.
When a new modality was added to control the robot, the robotic system was able to navigate in unfamiliar environments and independently determine the algorithm of actions to solve the task. A scientific paper describing the method was published in the international journal IEEE Access.
MIPT notes that the model in the scientific work was a robotic arm with six degrees of freedom. It had to sort objects on the table by color and assemble them in a specified area. The robotic arm chose its actions based on text instructions and data from video cameras.
According to the developers, the operating principle of the manipulator training algorithm "resembles the GPT model", but the developed model, instead of text, outputs a sequence of actions for the robot. After each action, the electronic computing device that controls the robotic arm receives feedback from video cameras and then plans a new action.
The novelty of the work is that we used ready-made language models to train the robot - algorithms that help translate natural speech into code that is understandable to the control system. They are neural networks that are pre-trained on large amounts of text data. In our case, the multimodal model RozumFormer was used. Unlike others, it can generate a response to both text queries and those made in the form of images.
RozumFormer was further trained to "understand" the colors of the cubes, the distances to them, and other parameters of the surrounding reality, and it began to control the manipulator. Step-by-step adaptation prepared the neural network so that, receiving feedback from video cameras, it could independently plan further actions and solve the tasks set before it based on the learned algorithms. And it succeeded.
Now, the scientists face the task of training the model to remember chains of longer actions. Subsequently, in the future, this will help robots cope without a human with actions that require a non-standard approach for a robotic system and an instant assessment of the situation. For example, in the simplest version for a robot assistant, this is washing dishes, cleaning, and sorting items by room and by purpose during cleaning.