Boosting embedded AI performance with edge AI processors

Article By : Rehan Hameed

A look at Kinara's accelerator and NXP processors which combine to deliver edge AI performance capable of delivering smart camera designs.

The arrival of artificial intelligence (AI) in embedded computing has led to a proliferation of potential solutions that aim to deliver the high performance required to perform neural-network inferencing on streaming video at high rates. Though many benchmarks such as the ImageNet challenge work at comparatively low resolutions and can therefore be handled by many embedded-AI solutions, real-world applications in retail, medicine, security, and industrial control call for the ability to handle video frames and images at resolutions up to 4kp60 and beyond.

Scalability is vital and not always an option with system-on-chip (SoC) platforms that provide a fixed combination of host processor and neural accelerator. Though they often provide a means of evaluating the performance of different forms of neural network during prototyping, such all-in-one implementations lack the granularity and scalability that real-world systems often need. In this case, industrial-grade AI applications benefit from a more balanced architecture where a combination of heterogeneous processors (e.g., CPUs, GPUs) and accelerators cooperate in an integrated pipeline to not just perform inferencing on raw video frames but take advantage of pre- and post-processing to improve overall results or handle format conversion to be able to deal with multiple cameras and sensor types.

Typical deployment scenarios lie in smart cameras and edge-AI appliances. For the former, the requirement is for vision processing and support for neural-network inferencing to be integrated into the main camera board. The camera may need to perform tasks such as counting the number of people in a room and be able to avoid counting them twice if subjects move in and out of view. Not only must the smart camera be able to recognize people but also be able to re-identify them based on data the camera has already processed so that it does not double-count. This calls for a flexible image-processing and inferencing pipeline where the application can handle the basic object recognition as well as sophisticated inferencing-based tasks such as re-identification.

Building smart cameras and edge AI appliances

Typically, in a smart-camera design, a host processor takes sensor inputs that are converted into forms suitable for inferencing: resizing, cropping, and normalizing the frame data into forms that are suitable for high-throughput inferencing. A similar but more highly integrated use-case is the edge-AI appliance. This takes inputs from multiple networked sensors and cameras, which demands the ability to handle multiple simultaneous compressed (or encoded) video streams. In this multi-camera scenario, the processing power must be able to scale to handle the format, color-space, and other conversions required to perform inferencing and also be able to deal with multiple parallel inferences.

Smart Camera Graphic - Kinara
Smart camera application flow. (Image: Kinara)
Edge AI appliance Graphic
Edge AI appliance application flow. (Image: Kinara)

Though fixed SoC-based implementations deal with specific use-cases, the need for scalability points to the ability to tune the platform to the requirements and build in support for extendibility and upgrades as customer needs change. For this reason, it is important to focus on platforms where there is an easy ability to scale hardware capabilities that does not require major changes to code caused by the need to specify devices that employ different architectures. Few can afford the porting overhead this implies.

Many developers have adopted the embedded-processing platforms sold by vendors such as NXP Semiconductors and Qualcomm because of the wide range of performance, features, and price options they offer. For example, the NXP i.MX applications processors cover a wide range of performance demands. In contrast to the fixed SoC platforms, NXP’s processor family benefits from the vendor’s long-term support and supply guarantees that are necessary in many embedded-computing markets. Devices such as the i.MX 8M provide a good basis for edge-AI appliance requirements. Its built-in video decoding acceleration makes it possible to support four compressed 1080p streams on the one processor. The ability to perform inferencing on multiple streams or handle sophisticated models is accommodated by coupling this i.MX applications processor with Kinara’s Ara-1 accelerators.

Requirement for running multiple models

Each accelerator can run multiple AI models on each frame with zero switching time and zero load on the host processor, providing the ability to perform complex tasks in real time. In contrast to some inferencing pipelines that rely on batching of multiple frames to maximize throughput, the Ara-1 is optimized for a batch size of 1 and also for maximum responsiveness.

This means that there is no need for a smart-camera design to rely on the host processor for performing a re-identification algorithm if the accelerator is performing inferencing on another frame or portion of one. Both can be offloaded to the Ara-1 to take advantage of its higher speed. Where more performance is required, such as in edge-AI appliances where different applications may require inference tasks to be performed, multiple accelerators can themselves be used in parallel.

A higher degree of extendibility can be enabled by supporting not just chip-down integration on the smart-camera or appliance PCB but plug-in upgrades. For chip-down integration, Ara-1 supports the industry standard and high-bandwidth PCIe interface for an easy connection to host processors that incorporate PCIe Gen 3 interfaces. A second integration path is to use modules that can plug directly into an upgradeable main board, taking advantage of the PCIe interface and providing the ability to process inputs from as many as 16 camera feeds. For systems as well as prototypes using off-the-shelf hardware, a further option is to take advantage of built-in support for USB 3.2. A simple cable connection provides the ability to test AI algorithms on a laptop, use hardware evaluation kits to kickstart production, or provide simple upgrades to existing systems.

Software infrastructure for seamless transition

Developers have a choice of approaches they can use to streamline integration of the accelerator with the processor and its associated software stacks. Models can be deployed and managed at runtime using C++ or the increasingly common Python application programming interfaces (APIs) running within a Linux environment on Arm or Windows on x86. Kinara’s runtime API supports commands for loading and unloading models, passing model inputs, receiving inference data and full control of inferencing and the hardware devices.

Kinara - NXP - Example development flow.
Example development flow. (Image: Kinara)

The GStreamer environment provides another way to access the performance of the accelerators. As a library designed for the construction of compute graphs of media-handling components, GStreamer makes it easy to assemble pipelines of filters that can be assembled into more complex applications that can react to changing situations in the incoming video and sensor feeds.

For AI inferencing an SDK such as Kinara’s can take trained models in many different forms, including TensorFlow, PyTorch, ONNX, Caffe2, and MXNet, and offers direct support for hundreds of models such as YOLO, TFPose, EfficientNet, as well as transformer networks. This provides a complete environment for optimizing performance through quantization, using automatic tuning to ensure model accuracy is preserved, and scheduling execution at runtime. With such a platform it is possible to provide insights into model execution to facilitate performance optimization and parameter tuning. A bit-accurate simulator lets engineers evaluate performance before deploying to silicon.

In summary, as AI becomes an integral part of a growing range of embedded systems, it is important to be able to integrate inferencing functions into a wide range of platforms to take care of evolving needs. This means being able to deploy a flexible accelerator with associated SDK to allows customers to combine advanced AI acceleration with pre-existing or new embedded systems.


This article was originally published on Embedded.

Rehan Hameed received his PhD from Stanford University where he worked on energy efficient processor design and polymorphic multiprocessor architectures and co-invented the Kinara architecture. In 2014 he co-founded Kinara and is now the CTO where he focuses on driving the company’s technological vision, including continuous evolution of the software stack and silicon architecture. Previously he has worked on the design of multiple chips and algorithms for audio and vision processing.


Leave a comment