Scientists from MIPT, along with international researchers, have presented a new stereo vision algorithm, Un-ViTAStereo, which allows robots and drones to "see" the world in volume—without expensive lidars and complex manual markup.
Simply put, the system learns to determine the distance to objects in the same way as a human—using two images. But it does this more accurately and stably, especially in difficult conditions.
The development is based on the Depth Anything V2 model. It analyzes a single image and estimates the depth of the scene based on indirect signs such as shadows, perspective, and overlaps. Then, the algorithm filters only the data that matches the "prompts" of the mentor model and builds an accurate distance map based on them.
The system works in three steps: first, it checks each pixel, then "completes" problem areas through neighboring points, and finally smooths the result to obtain a complete picture.
The result is a noticeable increase in accuracy. In drone tests (KITTI 2015 dataset), the proportion of gross errors decreased to 5%. This is 23% fewer critical errors in determining distances.
Traditional stereo vision systems often "go blind" in complex scenes—for example, in front of uniform surfaces, in fog, or among dense foliage. The new algorithm partially solves this problem and remains more accessible for implementation, as it does not require lidars—one of the most expensive elements in autonomous transport systems.
The developers note that this is only the first step. In the future, they plan to make the algorithm self-learning and further enhance its accuracy using lidar data.