Unlike GPUs, which have established programming models and software toolchains, AI processors are confronted with lack in software support.
While artificial intelligence (AI) advancements are often powered by massive GPUs in data centers, deployment of AI algorithms on edge devices requires a new generation of power- and cost-efficient AI chips. In recent years, several vendors have developed innovative AI chip architectures to address this need. However, unlike GPUs, which have well-established programming models and software toolchains, current AI processors often focus on performance benchmarks and there is a lack in software support.
An AI processor can’t be characterized by metrics like the number of tera operations per second (TOPS) or ResNet50 inferences it can process per second. AI algorithms are much broader than ResNet50 and the software applications and system configurations in which AI gets deployed are very diverse and continuously changing. Supporting these use cases requires a software stack with comprehensiveness and maturity similar to GPUs.
Neural operators, topologies and model sizes
AI chips typically accelerate convolutions and matrix-multiply operations. However, a modern AI framework like TensorFlow has over 400 inference operators, including recurrent cells, transpose convolutions, deformable convolutions, and 3D operators. When developers try an AI accelerator, they often find that the compiler stack doesn’t support their networks’ operators. It can be painful and expensive to add that support if it can be added at all.
Not only do AI researchers keep concocting new operators, the model architectures are evolving as well. As an example, transformer architectures were originally built for natural language processing. Now they’re used for vision processing. Furthermore, unsupervised learning and even on-device training are becoming more common. Ultimately that means the accelerator’s architecture and its compiler stack must be flexible to support the entire gamut of today’s workloads while also comprehending future workloads.
Another requirement is support of varying model topologies. Canonical classification networks like VGGs or ResNet follow a simple feed-forward structure where each layer connects to a single subsequent layer (Figure 1a). Most accelerators excel at these types of architectures. However, real-world applications such as semantic segmentation, object detection, pose estimation, and activity recognition typically use models with more irregular topologies employing complex spatial connectivity patterns between layers (Figure 1b).
Figure 1 Canonical classification networks follow a simple feed-forward structure (1a, top) while real-world applications typically use models with more irregular topologies (1b, bottom). Source: Deep Vision
Models with recurrence have different connectivity patterns in space and time. So, AI processor’s architecture must be defined without rigid assumptions on dataflow and control flow. Moreover, the compiler must seamlessly and automatically map any arbitrary neural topology to the hardware. This is often an area with huge gaps in capability.
Edge AI models also vary in size, ranging from a few hundred thousand to hundreds of millions of parameters, so an accelerator must efficiently support all model sizes without limitation. Such limitations often exist, typically due to the accelerator’s on-chip memory size. This becomes more significant in applications requiring multiple models which is typically the case with edge AI applications.
AI software usage models
Edge AI applications employ a very different usage model compared to AI in data centers. The latter is typically tasked with processing hundreds or thousands of data streams in parallel with high throughput as the key metric. An edge AI system processes only one or a few data streams with the primary concern being completion of each inference with lowest latency to enable real-time operation. It’s important to understand if an accelerator’s design fits your application’s usage model.
As an example, consider Accelerator X shown in Figure 2, which can process 200 inferences per second (IPS). Let’s examine how it operates in a real-world application, where it’s embedded inside a camera, processing frames at 30 FPS—a new frame arrives every 33 milliseconds. Imagine that the AI application has a pipeline of three models that must execute sequentially on every frame. The total inference requirement for this application is 90 IPS—3 inferences per frame x 30 FPS. Based on benchmark results, this accelerator should clearly support more than twice the desired performance. Wrong!
Figure 2 This is how Accelerator X operates in a real-world application. Source: Deep Vision
Investigating the benchmark process for this accelerator reveals important assumptions. First, it assumes that the accelerator is repeatedly doing inferencing on a single model. Second, it assumes that multiple inferences for this model are in flight through the hardware concurrently. As a result, while each inference takes a relatively long time of 50 milliseconds, the average throughput achieved is still relatively high at 200 IPS. Note that despite this concurrent execution assumption, this is still claimed as a batch of one, since inferences are submitted to the accelerator one at a time.
Neither assumption is true for this real-world application. Instead of running a single model, it switches models at every inference, and operates on a single camera stream. Moreover, this accelerator incurs an overhead in switching models, which is typical for throughput-oriented architectures. Under real-world conditions, this accelerator’s effective inference rate is less than 20 IPS. Each frame takes about 200 milliseconds to process, as opposed to the 33-millisecond target, resulting in an order of magnitude less performance than what was expected based on benchmarking.
Expectations from AI chip tools
Optimizing a neural network for edge deployment involves a trade-off between highest possible accuracy and the accelerator’s compute resources. This requires iterating over model size, architecture, and input resolution. A developer might evaluate multiple candidate architectures on the hardware, determine the performance and accuracy, identify where improvements are needed, update the model design, and repeat the cycle.
Since most applications require multiple models, you can’t do this process in isolation for each model because you must ensure the entire model pipeline fits the available compute budget. Furthermore, in parallel to this refinement process, additional data is continuously gathered to test these models in more scenarios, often prompting further model design changes.
This iterative design process requires rapid prototyping where a developer can quickly compile a model. Here, the tools provide extensive simulation capability to verify the fidelity of inference results in software without requiring hardware deployment. These tools accurately estimate latency, IPS, and power consumption achieved by the models. Additionally, with a bottleneck in performance or accuracy, the tools must quickly locate and mitigate the bottlenecks.
AI Accelerator isn’t an array of MAC and memory units
Given the diversity of AI applications and toolchain requirements, building hard-wired AI solutions is not a viable approach. An AI accelerator design should provide efficient primitives that the compiler can leverage to create the best execution plan for each neural graph. This requires flexibility at every level of hardware design, including highly efficient but fully programmable compute units, a flexible memory system, and data routing framework which allows the compiler to create any data flow through the chip.
Also required is a task management approach which offloads this burden from the software, but still enables the compiler to precisely manage all chip activities at a very fine grain. While this flexibility is essential for supporting ever-changing AI workloads, the developer should get a ‘black-box’ experience where the compiler abstracts all implementation complexities.
These features and considerations directly relate to how Deep Vision has architected its AI accelerator and software solution. Interacting with customers, we’ve found these are exactly the right features. We designed our hardware and compiler stack for real-world, edge machine learning use cases which almost always need lowest latency and multiple model support with zero-overhead switching. Our architecture and compiler handle models of any sizes, complexities, or topologies, and supports rapid model evaluation and prototyping. A key lesson is that no matter how good the AI accelerator appears, customers always require even better software tools.
This article was originally published on EDN.
Rehan Hameed is chief technology officer (CTO) at Deep Vision Inc.