High-quality data is the fuel for AI algorithms. Without a continuous flow of labeled data, bottlenecks can occur and the algorithm slowly deteriorates and increases the risk to the system.
This is why labeled data is so important for companies like Zoox. Cruise and Waymo, who use it to train machine learning models to develop and deploy autonomous vehicles. This need led to the founding of Scale AI, a startup that uses software and people to process and label image, lidar, and map data for companies that develop machine learning algorithms. Companies that work on autonomous vehicle technology make up a large part of the scale Customer base, although its platform is also used by Airbnb, Pinterest and OpenAI, among others.
The COVID-19 pandemic has slowed or even stopped this flow of data as AV companies have suspended tests on public roads – the means of collecting billions of images. Scale hopes to turn the tap on again for free.
The company released an open source dataset called PandaSet in collaboration with Lidar manufacturer Hesai this week that can be used to train machine learning models for autonomous driving. The dataset, which is free and licensed for academic and commercial use, contains data collected using Hesai's forward-looking PandarGT lidar with image-like resolution and the mechanical spinning lidar known as Pandar64. The data was collected while driving in urban areas in San Francisco and Silicon Valley before officials in the region said orders to stay at home in the region.
"AI and machine learning are incredible technologies with incredible impact, but also a big pain in the ass," said Scale CEO and co-founder Alexandr Wang in a recent interview with theinformationsuperhighway. “Machine learning is definitely a kind of framework for garbage in, garbage out – you really need high-quality data to support these algorithms. For this reason, we developed Scale and use this data set today to drive the industry with an open source perspective. "
The aim of this lidar dataset was to provide free access to a dense and content-rich dataset that Wang said was achieved through the use of two types of lidars in complex urban environments with cars, bicycles, traffic lights, and pedestrians.
"The Zoox and the world's cruises will often talk about how battle-tested their systems are in these dense urban environments," said Wang. "We really wanted to make that accessible to the whole community."
The data set contains more than 48,000 camera images and 16,000 lidar sweeps – according to the company, more than 100 scenes of 8 seconds each. It also includes 28 annotation classes for each scene and 37 semantic segmentation labels for most scenes. With conventional cuboid marking, these small boxes, which are placed around a bicycle or a car, for example, cannot adequately identify all lidar data. Therefore, Scale uses a point cloud segmentation tool to precisely comment on complex objects such as rain.
Open sourcing AV data is not entirely new. Last year Aptiv and Scale released nuScenes, an extensive data set from an autonomous vehicle sensor suite. Argo AI, Cruise and Waymo were among a number of AV companies that also shared data with researchers. Argo AI released curated data along with high-resolution maps, while Cruise shared a data visualization tool it created, called Webviz, that captures the raw data collected by all the sensors on a robot and converts this binary code into visual elements.
Scale's efforts are slightly different. For example, Wang said that the license to use this record has no restrictions.
"There is currently a great need and a constant need for high quality labeled data," said Wang. “This is one of the biggest hurdles when building self-driving systems. We want to democratize access to this data, especially at a time when many self-driving companies cannot collect it. "
That doesn't mean that Scale suddenly reveals all the data. After all, it's a profit-making company. However, it is already considered to collect more current data later this year and to practice open sourcing.