ROSC for neural network accelerator functional coverage

Article By : James Imber and Tim Atherton

While the RISC (reduced instruction set computer) design philosophy has changed the face of computing forever, this article presents a new take on functional coverage for neural network accelerators: reduced operation set computing, or ROSC.

While the RISC (reduced instruction set computer) design philosophy has changed the face of computing forever, in this article we present a new take on functional coverage for neural network accelerators: reduced operation set computing, or ROSC.

Looking at RISC and how it evolved, the emphasis is on a small, highly flexible, and low-level instruction set allowed deeper pipelining, a shift in complexity to the compiler, and better overall performance. RISC has dominated computer architecture for decades and is fundamental to today’s leading processor architectures.

However, the advantages of RISC evaporate in applications where computation is dominated by a small number of complex operation types. A prime example of this kind of compute problem is convolutional neural network (CNN) inference, where the vast majority of compute and bandwidth requirements are for a small number of layer types: for example, convolution, pooling, and activations. In a setting like this a hardware accelerator is called for with dedicated, fixed-function implementations of these common tasks. Doing anything other than this will result in suboptimal hardware, consuming more power and area to reach the same target number of operations per second.

Enter the concept of ROSC

Building a neural network accelerator (NNA) around highly optimized fixed-function hardware works very well for the vast majority of the compute requirements for the network. However, there will inevitably be a large number of relatively uncommon layer types left over: these typically account for a small fraction of the compute requirements, and may include layers such as softmax, argmax and global reductions among many others. We need a good way to handle these leftover layer types.

The problem with highly optimized, dedicated hardware accelerators is that they are narrow in focus: every module must by necessity be designed to do one task extremely well. This results in narrow specialization, which is usually understood to limit the application of that hardware to the domain for which it was designed.

Reduced operation set computing (ROSC) is Imagination Technologies’ (IMG’s) solution to this problem. It comes from the realization that some hardware accelerators contain a highly disguised, general operation set. ROSC is building new operations (for which specialized hardware does not exist on the accelerator) out of one or more of the available fixed-function operations.

How to do this is at first non-obvious and often requires some creativity – the hardware sometimes gets used in highly unorthodox ways! However, as a library of such techniques is built up to implement common operations, it becomes progressively easier to reuse them to build new operations. As a methodology, this can extend the flexibility of the accelerator well beyond its primary applications. It brings many of the advantages of RISC to hardware accelerators, such as operation reuse, generality and a shift in complexity to the compiler, without the need to introduce new hardware.

The more conventional alternatives to ROSC are generally as follow:

  • Perform these operations on another device, such as a CPU, GPU or DSP. This is undesirable because it consumes system bandwidth and valuable compute resources from the rest of the system.
  • Include a general-purpose programmable unit such as a microprocessor within or alongside the design. This adds the missing functionality but increases hardware complexity and the power and area overhead. The computational density (ops per unit area) is generally lower for this kind of hardware compared to fixed-function hardware.
  • Add more dedicated hardware blocks for each missing layer type. Although this allows for highly optimized implementations of new modules, it puts the architecture in the position of perpetually playing catch-up to the state of the art (that is, it is not futureproof). It also leads to problems with hardware bloat and dark silicon.

All the above have disadvantages, such as increasing area and power consumption or consuming system resources such as CPU time and bandwidth. By contrast, ROSC provides an elegant way to reuse our existing fixed-function hardware for common neural network operations to cover a very wide variety of other layer types.

Using ROSC to build complex layers

For example, softmax could be built out of hardware-supported operations as shown below. The target architecture in this case is IMG’s Series4 NNAs.

ROSC article figure 1
  • A 1×1 convolution with a weight tensor with a single filter composed entirely of ones can be used to implement the cross-channel sum.
  • Division can be implemented as a multiplication of one tensor with the reciprocal of the other. IMG Series4’s LRN (local response normalization) module can be configured to compute a reciprocal.
  • Cross-channel max can be done by transposing the channels onto a spatial axis and doing a series of spatial max-pooling ops. Afterwards, it is transposed back onto the channel axis.
  • Since the exponential is limited to negative and zero input values, the activation LUT can be configured with an exponential decay function.

Using fixed-function hardware in ways other than the usual mode of operation can reduce utilization. Nevertheless, the benefits of keeping data on-device usually outweigh the drawbacks. For example, suppose that we only achieve 1% utilization of a 4-core IMG Series4 NNA for the above implementation of softmax. This NNA can achieve 40 TOPS at full utilization, so even at 1% utilization this still works out at a very respectable 400 GOPS. The availability of on-chip memory coupled with Imagination’s tensor tiling algorithm means that the intermediate data can be kept local, minimizing bandwidth consumption. Finally, we have avoided the need for a coprocessor to execute this layer, and do not need to consume host CPU time. ROSC is good for code reuse. For example, once we have produced implementations for division and cross-channel maximum for softmax, we can reuse these in other layers. Consider the instance normalization implementation below, which reuses the division implementation from softmax. The square root operation is also implemented via the LRN module. The global mean reduction uses the same trick as we used for global max reduction in softmax.

ROSC article figure 2

It is easy to see how a library of reusable low-level building blocks can be built up in this way, making it progressively simpler to implement new layer types. This is how we can achieve futureproofing using ROSC. ROSC also fits naturally within existing graph lowering compilers such as Glow and TVM, where we can decompose high-level layers into a computational graph as shown above and replace each part in turn with subgraphs composed of primitive neural network operations.

In applications dominated by a small set of complex operations, such as neural network inference, computational density is maximized using fixed-function hardware, resulting in a CISC processor with what appears at first sight to be extremely limited functional coverage. However, we have found that we can re-task dedicated NNA hardware (in our case, IMG Series4) to cover a surprisingly wide range of layer types.

Using the hardware in this highly unorthodox way tends to reduce utilization. However, the parts of the workload we apply it to are small and we find that it is very often a price worth paying as it increases performance overall, and reduces both bandwidth and power, particularly since the NNA is so computationally powerful compared to other available devices in a typical SoC.

This article was originally published on Embedded.

James Imber is a member of Imagination Technologies’ AI research team, where he works primarily on neural network accelerators, compilers and low-precision inference targeting embedded systems. With nine years’ experience as a researcher in the semiconductor IP industry, he has accumulated 24 granted patents and has contributed to publications in international computer vision conferences including ECCV and ICPR. He undertook his PhD studies at the University of Surrey’s center for vision, speech and signal processing (CVSSP) on shape-assisted intrinsic image decomposition and holds a BEng from the University of Southampton in electronic engineering.

Tim Atherton is director of research in AI at Imagination Technologies. Prior to joining Imagination Tim was a prize-winning academic (computer science) at Warwick University specializing in mathematical models of biological vision, high performance computing (HPC) architectures and technology transfer to commercial and government organizations.

Leave a comment