Introduction
For many tech enthusiasts, it is cool to see how quickly machine learning and computer vision are being used in the daily lives of people around the world. Most manufacturing companies are already implementing and integrating these technologies into their products. Modern computer vision models and algorithms have become widespread due to the rapid improvement of computing devices. These devices are getting smaller and becoming more powerful, which allows us to use neural networks even on our mobile devices.
Another important innovation in computer technologies is AI accelerators. Without going into too much detail, AI accelerators are high-performance chips for parallel computations that are designed especially for the efficient processing of AI tasks. Today, we will take a closer look at a device that has an AI accelerator.
What is OpenCV AI Kit
The OpenCV AI Kit (OAK) got its start when a team of creators formed around the idea of using spatial AI and computer vision to augment human perception.
As a result, the OAK was developed, which is a modular, open-source ecosystem composed of MIT-licensed hardware, software, and AI training. The OAK allows users to embed the super-power of spatial AI with accelerated computer vision functions into their products.
What is Spatial AI?
Spatial AI is the ability of a visual AI system to make decisions based on two things:
- Visual Perception: AI that can “see” and “interpret” its surroundings visually. A good example would be a system in which a camera is connected to a processor and is running a neural network for object detection that can detect a cat, a person, or a car in a scene that the camera is looking at.
- Depth Perception : AI that can understand how far objects are. In computer vision jargon, “depth” simply means “how far.”
The idea of Spatial AI is inspired by human vision: we use our eyes to interpret our surroundings, and we also use our two eyes (i.e. stereo vision) to perceive how far things are from us.
According to its creators, the OpenCV AI Kit, is a spatial AI powerhouse that is capable of simultaneously running advanced neural networks while providing depth from two stereo cameras and color information from a single 4K camera in the center. OAK cameras are built on top of the Myriad X AI accelerator, making them compact and powerful.
There are different types of OAK devices: OAK–1, OAK–D, OAK–D–Lite, OAK–D–IoT, etc. The letter D in the camera name stands for Depth. So, we can think about OAK–D as a 4K camera with the ability to run neural networks and calculate its distance from objects.
If you are interested in hardware details, you can find all the features and specifications of the camera on the Luxonis official site. For now, let’s focus on the camera capabilities and what we can achieve by using this device.
Capabilities
There are two primary tasks that can be solved by the OAK: running neural networks on the edge device and distance calculation and positioning of the object from the frame. This combination gives us a very powerful tool.
OAK cameras are designed for different computer vision problems, which allows them to be used in various tasks from simple affine transforms to complex detection pipelines. Additionally, depth images can be used for object positioning tasks or SLAM.
Let’s investigate what we can achieve using the OpenCV AI Kit.
Overview
Existing CV models can be used for further conversion into a suitable format, or the model can be trained from scratch. Helpful hint: Luxonis provides notebooks for the training models for many purposes. Currently, they offer MobileNetv2-SSD, tiny-YOLOv3, tiny-YOLOv4, YOLOv5, and Deeplabv3+ models for training. Additionally, existing projects can be explored in the Luxonis repo.
In this section, we will review the process of model creation and conversion, pipeline creation, and depth image processing in a car license plates recognition example. The pipeline will detect cars, find and recognize text on license plates, and calculate the distance from the camera to cars.
So, in a nutshell, we will have a model for car and plates detection, text detection and recognition models, and depth estimation.
Training
We will use a public dataset with cars and license plates for training. We propose to use the existing training pipeline from Luxonis to simplify and speed up the development process. In this section, we will train the tiny-YOLOv4 model. The notebook for training can be found on the official Luxonis repo.
In this notebook, the authors use TensorFlow 1.x version. According to our investigation, it is necessary to use TF1 instead of TF2 because of issues in the model converting step. Some TensorFlow 2.x models can’t be successfully converted into OpenVino format (at least without deep investigation and modifying of the model). We will talk about converting steps a little bit further.
The first step in training is data preparation. We will be training the model using the Darknet framework. This means that the data needs to be prepared for the darknet format. Without going into too much details, we need to download the Darknet repository, create txt files with data annotations, create configuration files, and start the training process.
After this, the model is ready for the next steps. You can use existing models if you don’t need to train your own model on something specific.
Converting
Converting is one of the most critical steps. When the model is ready, it’s time to convert it to make it run on the OAK–D camera, or you can select an existing model from the Model Zoo. Also, we need to take into account that not all the models could be converted. Not all the operations are available for OpenVino convert. Here you can find supported operations.
Since we are working with the notebook provided by Luxonis, let’s continue with the obtained model. We stopped on the training step, and now we have the darknet model ready. Firstly, we need to convert the darknet model into the TensorFlow model. We can use this repo for the conversion of our model. Now you can convert the model by running this command:
python convert_weights_pb.py --class_names PATH_TO_obj.names --data_format NHWC --weights_file PATH_TO_yolov4-tiny_best.weights --tiny
Now the model is converted into TensorFlow 1, which is necessary due to the upcoming conversion. If you are using a TensorFlow 2 model, then the next step could cause errors.
You have to install the OpenVino toolkit for further conversion. Installation steps are also provided in the Luxonis notebook. If you successfully installed the toolkit and prepared the TF model, you are ready to run the next conversion. Be sure to modify yolo_v4_tiny.json
file according to your number of classes in the model.
When converting the model into IR, you should use the mo.py
file from the OpenVino toolkit:
python deployment_tools/model_optimizer/mo.py \
--input_model frozen_darknet_yolov4_model.pb \
--tensorflow_use_custom_operations_config yolo_v4_tiny.json \
--batch 1 \
--data_type FP16 \
--reverse_input_channel \
--output_dir TinyIR
This command generates .xml and .bin files. Now we must compile the IR model to a .blob for use on the DepthAI modules/platform.
The blobconverter
could be installed by pip. For the final step we must:
import blobconverter
blob_path = blobconverter.from_openvino(
xml=xmlfile,
bin=binfile,
data_type="FP16",
shaves=5,
)
That’s it! We got a .blob file of the model, and now it is ready to use in the OAK–D camera environment.
Pipeline
The pipeline is a complete workflow on the device side and is a collection of nodes and links between them. All the data sources, operations, and links between them have to be specified in the pipeline. Next, we need to design the pipeline.
Luxonis provides a visual pipeline editor for creating pipelines. We can create the pipeline using this builder or create it using the depthai
library. Let’s visualize the pipeline in the builder, but implement it using the python library.
Pictured above is our workflow for the current project. The pipeline describes how images from the cameras will be processed, how we obtain a spatial image, and shows the detection and recognition of license plates. We will describe all the components further in detail.
Firstly, we must import the DepthAI library and create a pipeline object:
import depthai
pipeline = depthai.Pipeline()
We need to define all the nodes and links between them to replicate the pipeline. We define input sources and XLinkOuts to interact with outputs and process images from the camera. XLinkOuts may have stream names, so we can access them to get results. Note that we will use two outputs from the main camera: xout_rgb
and xout_vid
. The first one will contain a resized version of the frame for the input for the YOLO network, and the second one is the full frame.
In the image below, you can see what frames we can get from the camera.
We will also define settings and properties for the cameras and the YOLO model. Note that we are using YoloSpatialDetectionNetwork
, which allows us to process detection results and spatial images simultaneously. You should specify the YOLO parameters that were used during the training.
There are two parts of the pipeline graph that are not connected through links. The first graph is related to plate detection and spatial image processing, and the second graph is related to license plate image cropping and recognition. To connect these two parts, we will use XLinks and ImageManip node.
Let’s define ImageManip and NeuralNetwork for license plate recognition. We will use the existing model for plate recognition from the model zoo. The model was trained on Chinese license plates, but we can use it for any type of plate if we ignore the region predicting part.
Now we have to connect the nodes between each other.
At this current stage, we successfully implemented the whole pipeline. Now, we need to connect to our device and run the pipeline. After that, we will be able to get all the required data from the camera.
Further steps are straightforward: get the results of the detection network, get the distance to the objects, and send images of the plates to the recognition network. Let’s take a look at the part where we send license plate images to the recognition network. The full Notebook is available on GitHub.
As mentioned earlier, we do not have a direct connection between the two parts of the pipeline. It is related to the scaling and cropping of license plate images. To transform plates images and communicate between the two parts of the pipeline, we are using ImageManipCofig:
At this point, we highlighted all the crucial parts for realizing the current demo project.
Conclusion
You can see how our project works in the image below. For demo purposes, we pointed the camera on the monitor with the car image on it. That is why distance calculation results may be confusing.
In summary, OAK is an excellent toolkit for computer vision on the edge. It allows us to solve various types of different problems, and it also has plenty to offer: from a massive amount of available out-of-the-box features to a large number of comprehensive guides and quick start projects.
On the other hand, building pipelines from scratch and operating with the device could be unusual for Python developers, but this issue can be solved by reviewing complete documentation from Luxonis.
The hardware of the device is also great. The primary camera with 4K resolution, stereo pair cameras, and Myriad X VPU make together a very powerful device.
This article written by Danylo Kosmin, Akvelon Machine Learning Engineer