Editor's Note: This article is part of Aspencore's Special Project on Embedded Artificial Intelligence (AI), which offers an in-depth look from a variety of angles at the business and technology of imbuing embedded systems with localized AI.

The explosion of artificial intelligence (AI) applications, from cloud-based big data crunching to edge-based keyword recognition and image analysis, has experts scrambling to develop the best architecture to accelerate the processing of machine learning (ML) algorithms. The extreme range of emerging options underscores the importance of a designer clearly defining the application and its requirements before selecting a hardware platform.

In many ways the dive into AI acceleration resembles the DSP gold rush of the late 90s and early 2000s. As wired and wireless communications took off, the rush was on to offer the ultimate DSP co-processor to handle baseband processing. Like DSP coprocessors, the goal with AI accelerators is to find the fastest, most energy-efficient means of performing the computations required.

The mathematics behind neural network processing involves statistics, multivariable calculus, linear algebra, numerical optimization, and probability. While complex, it’s also highly parallelizable. In fact, it’s embarrassingly parallelizable, meaning it’s easily broken down into parallel paths with no branches or dependencies (unlike distributed computing), before the outputs of the paths are reassembled and the output produced.

There are various neural network algorithms, with convolution neural networks (CNNs) being particularly adept at tasks such as object recognition -- filtering to strip out and identify objects of interest in an image. CNNs take in data as multidimensional matrices, called tensors, with each dimension beyond the third dimension being nested within a sub array (Figure 1) . Each added dimension is called an “order” so, a fifth-order tensor would have five dimensions.

Figure 1: CNN’s ingest data as tensors, which are matrices of numbers (data sets) that can be visualized as a three-dimensional cube, but within each array is a sub-array, the number which define the depth of the CNN. (Image source: Skymind)

AI is less about math, more about fast iteration

This multi-dimensional layering is important to understanding the nature of the acceleration required for CNNs. The process of convolution mathematically “rolls” two functions together using multiplication, hence the wide use of multiply-accumulate (MAC) math. In object recognition, for instance, one function is the source image, the other function is the filter that is being used to identify a feature, which is then mapped to a feature space. This rolling is done multiple times for each filter that is required to identify a different feature in the image, so the math gets very repetitive, and embarrassingly (or pleasingly) parallelizable.

To this end, the designs of some AI accelerators employ multiple independent processing cores, numbering in the hundreds, or thousands, placed on a single chip together with tightly coupled memory subsystems to mitigate data access latencies and reduce power consumption. Because graphics processing units (GPUs) were designed for highly parallel processing of image manipulation functions, however, they have also turned out to also be good candidates for acceleration of the neural network processing of the type required for AI. The variety and depth of applications for AI, particularly with respect to voice control, robotics, autonomous vehicles, and big data analytics, has lured GPU vendors to shift emphasis and pursue development of hardware acceleration of AI processing

The problem in AI hardware acceleration, however, is the that there is so much data, and so much variation in the accuracy and response times required, that designers need to be very particular about which architecture they choose. For example, a data center is massively data intensive and because the focus is on crunching that data as fast as possible, power consumption is not an explicit factor, though energy efficiency is always good to extend component lifespan and reduce overall facility power and cooling costs, which do add up. Baidu's Kunlun processor, which consumes 100 watts but offers 260 TOPS, would be a good candidate for this application.

At the other extreme, a task such as keyword speech recognition, which then opens up a connection to the cloud to execute further commands using natural language recognition, can already be performed on a battery powered edge device based on GreenWaves Technologies' GAP8 processor. This was designed for the edge and so emphasizes ultra-low power. In the middle, something like a camera in an autonomous vehicle that needs to respond as close to real time as possible to identify road signs, other vehicles, or pedestrians, all while still minimizing power consumption, particularly for electric vehicles, might benefit from a third choice. In such applications, a cloud connection is also important to allow updates of the models and software employed, to continually improve accuracy, response times, and efficiency.

Don’t commit to an ASIC just yet

It is this need for updating in a technology area that is rapidly evolving with respect to both software and hardware that makes it inadvisable to bake an AI NN accelerator into an ASIC or even a system-in-package (SiP). This, despite the associated benefits of lower power, smaller footprint, cost (at high volumes) and memory access optimizations. Accelerators, models and NN algorithms are too much in flux, with so much more flexibility than instruction-driven approaches that only leading-edge, deep-pocket vendors like Nvidia can afford to iterate upon a particular approach in hardware.

A good example of the work in progress for accelerators is Nvidia’s addition of 640 Tensor Cores to its Tesla V100 GPU. Each core performs 64 floating-point (FP) fused-multiply-add (FMA) operations per clock cycle, providing 125 TFLOPS for training and inference applications. With the architecture, developers can run deep learning training using a mixed precision of FP16 with FP32 accumulate to get a 3X improvement over Nvidia’s own previous-gen Pascal architecture.

The mixed-precision approach is important, as it has long been recognized that while high-performance computing (HPC) require precision computation with 32- to 256-bit FP, this level of precision isn’t required for deep neural networks (DNNs). This is because the back-propagation algorithm that is often used in training them are resilient to errors, so 16-bit, half-precision (FP16) is sufficient for training NNs. In addition, storing FP16 data is more memory efficient than storing FP32 or FP64 data, allowing training and deployment of larger networks, and for many networks, 8-bit integer computations are sufficient, without too much of an impact on accuracy.

This ability to use mixed-precision computation becomes even more interesting at the edge, where developers can trade precision for lower power when taking inputs from low-precision, low-dynamic-range sensors, such as temperature sensors, MEMS-based inertial measurement units (IMUs), pressure sensors, and low-res video.

AI architecture choices span fog computing from edge to cloud

The idea of scalable processing has been extended to the wider network using the concept of fog computing to close the capabilities gap between the edge and the cloud by performing the required processing at the optimum place in the network. For example, instead of doing NN image processing in the cloud, it could be done at a local Internet of things (IoT) gateway or on-premises server that is much closer to the application. This has three direct benefits: it reduces latencies due to network delays, it is more secure, and it also frees up available network bandwidth for data that must be processed in the cloud. At a higher level, it’s also generally more energy efficient.

That said, many designers are working on standalone products with integrated cameras and image-pre-processing and NN AI signal chains that only present an output, such as a recognized sign (autonomous vehicle) or face (home security systems) in a relatively closed-loop operation.

In more extreme cases, this processing may need to be done on a battery- or solar-powered device in remote or hard-to-access locations over long periods of time.

To help minimize power consumption for this level of edge-based AI image processing, GreenWaves Technologies’ GAP8 processor comprises nine RISC-V cores[PM1]. One core is assigned to hardware and I/O control functions, the other eight form a cluster around shared data and instruction memory (Figure 2). The structure forms a CNN inference engine accelerator, with additional RISC-V ISA instructions to boost DSP-type operations.

Figure 2: GreenWave’s GAP8 uses nine RISC-V processors and is optimized for low-power AI processing on intelligent devices at the network edge. (Image source: GreenWaves Technologies)

The GAP8 is designed for use on intelligent devices at the network edge and is capable of 8 GOPS while consuming only a few tens of milliwatts (mW), or 200 MOPS @ 1 mW. It is fully programmable in C/C++ and has a minimum standby current of 70 nA.

[Continue reading on EDN US: AI processor architectures]

Patrick Mannion has a long association with EDN, EE Times, and other publications, and currently leads an independent content engineering firm specializing in technology analysis, editorial, and media services.

Want to dig deeper into embedded AI? Check out these other articles from Aspencore's Embedded AI Special Project.

Artificial Intelligence (AI): Who, What, When, Where, Why?    -- These days it seems like everyone is talking about artificial intelligence (AI), but what is it, who is doing it, and why is it important? -- EE Web

Applying machine learning in embedded systems -- Machine learning has evolved to become a practical engineering method if approached with a suitable appreciation of its associated requirements and current limitations -- and it's more accessible than you might think. -- Embedded.com

AI attracts chips to the edge - Virtually every embedded processor vendor has a product or is working on one for accelerating deep learning tasks on client systems at the network’s edge. -- EE Times

Designer’s Guide:  Selecting AI chips for embedded designs – By asking four key questions developers will be able to zero in on the best AI processor candidates for their specific embedded AI project. -- Electronic Products

Related articles: