Benchmarking OpenCV on STM32 MCUs

Article By : Anton Bondarev and Alexander Kalmuk

OpenCV can execute several types of image processing applications on MCUs.

Image processing has become a part of our lives. Nobody is surprised by facial recognition or driving lane detection. Today, the most common library for these purposes is OpenCV. Currently, OpenCV is focused primarily on high-performance computing (HPC) platforms/microprocessors. Although higher-end microcontrollers have resources comparable to the Pentium II, running OpenCV on them is still very uncommon.

Some time ago, we proved that there is a possibility of using OpenCV on STM32 (and other microcontrollers of a similar class). Our goal was to demonstrate the use of this library on similar hardware platforms. Although we got very low performance, we did not investigate the reasons for that at the time. In this work, we have corrected the obvious shortcomings of the first tests. This allows us to achieve acceptable performance. This article presents the results of performance measurements for various examples in using OpenCV on the STM32F7 platform.

All examples in the article are based on Embox RTOS and can be reproduced by following the instructions in the repository with the examples. We use the -Os optimization flag for the examples on the board. All examples use enabled CPU cache. Data files can be located on an SD card. We keep the images in the QSPI flash that is on the demo board to simplify basic instructions when playing back the results.

We will use an STM32F769i-discovery board. There are two types of ROM on the board: 2 MB on-chip flash and 64 MB external QSPI flash. Also, there are two types of RAM: 512+16+4 KB on-chip and 16 MB external SDRAM.

Edge detection

Let’s start with the same example that was used in previous work, namely edge detection. The example uses Canny’s algorithm.

We provide the output when running edge detection; it allows you to compare the performance improvement from the previous work. For the other examples, we will provide only tables with measurement results.

A sample of the analyzed image:

Output for image 512×269
root@embox:(null)#edges fruits.png 20
Image: 512×269; Threshold=20
Detection time: 0 s 116 ms
Framebuffer: 800×480 32bpp
Output for 512×480
root@embox:(null)#edges fruits.png 20
Image: 512×480; Threshold=20
Detection time: 0 s 254 ms
Framebuffer: 800×480 32bpp

The results:

Image Execution time from inter4nal flash (ms) Execution time from external QSPI flash (ms)
fruits.png 512×269 116 120
fruits.png 512×480 254 260

K-means

This example from OpenCV determines the clusters of points and draws a circle of the corresponding colour around each of them.

OpenCV uses the concept of “compactness” to determine clusters:

compactness: It is the sum of squared distance from each point to their corresponding centers. (Source: OpenCV)

In other words, compactness is an indicator of how close the points are concentrated from the center of the cluster.

kmeans.cpp generates a 480 x 480 image with several clusters of dots of different colours as input. The center of each cluster is chosen at random, and points are added to the cluster in accordance with normal distribution.

Compactness Execution time from ROM (ms) Execution time from QSPI (ms)
733589 34 98
160406 6 18
331447 14 38
706280 13 36
399182 8 25

Squares

Recognition of geometric shapes, in particular rectangles, is also a standard example in the OpenCV library.

A sample of the analyzed images (taken from pic6.png listed in the table below)

The results for the images at 400×300:

Image Execution time from ROM (ms) Execution time from QSPI (ms)
pic1.png 1312 1668
pic2.png 4893 7268
pic3.png 1263 1571
pic4.png 2351 3590
pic5.png 1235 1515
pic6.png 1575 2202

Face detection

Face recognition was the original goal of our research. We wanted to estimate how well similar algorithms work on such boards. We use the standard ‘facedetect’ example with a set of five images. Examples use Haar-cascade Detection.

A sample of the analyzed images:

The results for images at 256×256:

Image Execution time from ROM (ms) Execution time from QSPI (ms)
seq_256x256/img_000.png 3389 3801
seq_256x256/img_001.png 4015 4454
seq_256x256/img_002.png 4016 4464
seq_256x256/img_003.png 3315 3717
seq_256x256/img_004.png 3526 3952

The results for images 480×480:

Image Execution time from ROM (ms) Execution time from QSPI (ms)
seq_480x480/img_000.png 14406 16149
seq_480x480/img_001.png 14784 16578
seq_480x480/img_002.png 15106 16904
seq_480x480/img_003.png 12695 14352
seq_480x480/img_004.png 14655 16446

People-detect

We decided to try more complex algorithms and chose human detection. We use the ‘peopledetect’ example from OpenCV, which is based on the histogram of oriented gradients (HOG).

A sample of the images:

The results

Image Execution time from ROM (ms) Execution time from QSPI (ms)
basketball2.png 640×480 40347 52587

QR code

QR codes are a widely used example of pattern recognition. In this example, the QR code is defined in the image and surrounded by a square frame. We used only detection without content recognition.

The sample which we used was given from the Internet:

The results:

Image Execution time from ROM (ms) Execution time from QSPI (ms)
qrcode_600x442.png 3092

This example uses different function calls so different code is inserted into a final image. Therefore, while we tried to build this example, it did not fit into the internal flash, therefore, the results are only from QSPI.

All examples use different algorithms (function calls) and different code is inserted into a final image. So when we tried to build this example, it did not fit into the on-chip flash, therefore, the results are only from QSPI.

Specifics of working on microcontrollers

There are a few notable points we found while working with OpenCV on microcontrollers. First, the code from the internal memory works faster than from the external QSPI flash, even with the cache enabled.

The second, in our opinion, also linked with the cache, is the dependence of performance on the location of the code. We found that minor code changes, such as inserting a command that is not called, can increase or decrease performance by 5 or more percent.

The third, is the limited size of internal memory. We were unable to quickly run an example with QR codes detection from internal flash.

Another important feature relates to the ARM Cortex-M cores. We used the CPU cores which support SIMD instructions. This technology helps to increase performance by performing the same operation on multiple data points in a single register simultaneously. To estimate if it influences performance in similar tasks, we carried out measurements on Linux with and without SIMD instructions support and found that in some examples such as squares, using SIMD increases performance up 80%, however the speed-up depends on the used algorithm.

For our CPU cores, there is only support in the form of intrinsic functions. In other words, it is necessary to insert these commands manually. OpenCV supports this approach. You can implement SIMD support for a custom architecture. But at the moment, OpenCV SIMD support is designed to work only with long data types (128 bits and more), while the Cortex-M7 core has support for only 32-bit registers. Therefore, this work did not assess the performance improvement when using SIMD on STM32. We hope this will be a direction for future research.

Conclusion

These results indicate that very complex software such as OpenCV can be used on microcontrollers. A couple of the examples were launched and all worked successfully. However, the performance is noticeably lower than that of the host platforms.

The use of OpenCV on microcontrollers is highly dependent on goals. Most of the basic algorithms work imperceptibly to the eye. Edge detection completed in a fraction of a second; this performance may be quite enough for an autonomous robot. Complex algorithms such as QR code processing also can be used, but it is necessary to evaluate the pros and cons of the solution. On the one hand, the 3 seconds required to complete face detection might be a long time for some applications, but on the other hand, for some purposes, it may be fast enough.

Therefore, we find that such platforms are not yet powerful enough to recognize complex objects, for example, to identify a person. The delay is very noticeable compared to the recognition of the same image on the host. However we must also understand that in this research we compared MCUs with 64-bit Intel-i7 with 8 cores and a fundamentally different frequency, and of course, the power consumption and the cost of these platforms are completely different. Besides this, the comparison did not involve the most powerful microcontroller. The STM32 has the H7 series, which has more than twice the performance.

You can see how it works in the video below:

Reproducing the results

You can reproduce the results obtained in the article. You need two repositories for this. The main Embox repository and the repository with sample images and ready-made configurations for the STM32F769i-discovery board. You should follow the instructions in the README file from the samples repository to reproduce the results.

You can also use other boards, however, you need to prepare the particular Embox’s configuration. Also, you can experiment with other images or store the images on the SD card, but it also requires changing Embox’s configuration.

This article was originally published on Embedded.

Anton Bondarev is the founder of Embox RTOS. Anton graduated from Saint Petersburg Electrotechnical University (LETI) in 2003 with a master degree in electrical engineering and attended postgraduate courses in Saint-Petersburg State University specializing in software engineering. He has over 20 years in embedded and system programming.

Alexander Kalmuk is the cofounder of Embox RTOS. Alexander graduated from Saint-Petersburg State University in 2014 with a master degree in mathematics and software engineering and attended postgraduate courses in Saint-Petersburg State University specializing in control theory. He has over 10 years in embedded systems programming.

Leave a comment