AI needs a 100-1000X performance improvement over current digital approaches, and analog compute-in-memory systems provide the only viable path forward.
Since the 1960s, analog compute has seen minimal deployment in commercial applications, relegated to use in military applications and niche industrial use cases. While digital compute advancements dominated commercial applications for many decades, some new advancements in analog computing are showing that the tide is turning. With the exponential growth in compute requirements for edge-AI applications, digital systems are struggling to keep up. It’s clear that the traditional method of scaling digital compute – that is moving to more advanced semiconductor process nodes – is reaching the limits of physics (i.e. Moore’s Law is dead), while escalating manufacturing costs have constrained the technology to only the few richest companies. New approaches are needed for the next generation of AI processing. Analog compute has proven to have 10X advantages in cost and power compared to digital systems, and the gap will only continue to widen.
Before diving into the feasibility of analog systems compared to digital ones for the AI era, let’s look at the two key factors for AI hardware: scalability and accessibility. Weight counts for AI algorithms can significantly vary; computer vision tasks like image recognition could have weights of 5M to 100M, while natural language processing has weights of 500M to 100B. These numbers will continue to increase as AI algorithms become even more sophisticated, so it’s critical that AI hardware is scalable for diverse applications. When you think of accessibility, it’s critical for AI hardware to be able to process information instantly. Latency issues constrain user experience, hamper productivity, and presents serious safety risks for certain applications.
The modern digital systems are based on the Von Neumann architecture, a computing concept originally introduced in 1945. The architecture specifies separate digital computing logic unit and memory unit for accessing and storing data. This is implemented as CPU or GPU computing logic accessing external memory, typically DRAM, in a digital system. Processing a large AI algorithm exposes a significant weakness in the Von Neumann architecture – that is accessing weights stored in external DRAM for logic processing during real-time AI processing in the edge device. This weakness creates three system level issues. First, accessing external memory adds latency, making memory bandwidth the bottleneck to system performance. Second, accessing external memory consumes significant power. And the power consumption will only escalate with increase in system performance requirements. Third, the BOM cost will increase to accommodate higher performing CPUs and GPUs, faster and more DRAM, and active cooling system to dissipate heat from the power consumption.
One particular approach that has shown a lot of promise is analog compute in-memory (CIM), which pairs analog compute with non-volatile memory (NVM) like flash memory. Analog CIM systems can leverage the impressive density of flash memory for data storage and computing. This means that analog CIM processors can run multiple large, complex deep neural networks (DNNs) on-chip, eliminating the need for DRAM chips. This approach completely eliminates the digital logic and external memory bottleneck, power consumption, as well as BOM cost associated with the Von Neumann system for AI processing.
Let’s take a closer look at the advantages of NVM. NVM has incredible densities and zero-power retention, which means weights stored in each cell will remain with no power applied. The analog CIM approach allows the NVM cells to store and perform arithmetic operations inside NVM cells, which works by combining small electrical currents across a memory bank in a fast and power-efficient manner. By using the NVM memory itself, the computations are done instantly. The analog CIM systems don’t need to use energy to access the weights in external memory, thus slashing energy usage.
In analog CIM systems, the flash transistor acts as a variable resistor that reduces the signal level passing to the output in proportion to the analog value stored in the memory. This effect then triggers the multiplication stage found in DNN calculations. During the accumulation process, the output from each of those calculations is summed by aggregating the output of a whole column of memory cells. This approach allows analog CIM systems to process an entire input vector in a single step, unlike digital processors that are forced to iterate at high speeds.
Key benefits of analog CIM
Whereas a typical digital edge inferencing implementation that holds the large weight arrays in DRAM might have an energy per multiply-accumulate (MAC) of 10pJ, the analog CIM approach could bring that to as low as 0.5pJ. When you consider that trillions of MAC computations are necessary for vision-based AI inference processing, the energy savings add up fast. So why do digital systems consume so much energy? Two reasons. First, the arithmetic process of multiplication. Digital systems will need to employ a large number of logic gates in parallel to achieve high throughput, and this number continues to grow significantly as data resolution increases. Second, accessing weights stored in external DRAM requires significant energy as resolution and video frame rate increases.
Additionally, thanks to the high density of NVM flash, it’s possible to use a single flash transistor as a storage medium and computing device, and an adder (accumulator) circuit, an extremely compact system can be realized. This also means that you can save on the cost of external DRAM and its associated components.
Analog CIM systems also have cost advantages since they can be manufactured in mature semiconductor process nodes. An added benefit is that bleeding-edge nodes often have limited supply chain availability, whereas more mature process nodes are more widely available and cost-effective.
Another benefit is that analog CIM systems provide very low latency. Storing and processing inside the NVM flash cell means instant computing results. There are no latency issues that comes from data propagating through digital logic gates and memory in the processor and accessing external DRAM. Instead, massively parallel matrix operations are performed on-chip in real-time.
Analog CIM systems are ideal for video analytics applications that include object detection, classification, pose estimation, segmentation, and depth estimation. The high frame and sample rates of these systems requires high levels of computational throughput. While digital systems can support basic requirements of real-time AI processing, these systems are large and extremely power-hungry. While active cooling methods are available, they are not feasible for many edge devices that are usually very compact. Another workaround that many digital systems use is to offload deep-learning work to remote cloud servers since these digital systems can’t meet the energy and size requirements of edge AI applications. The problem is that pushing inferencing to the cloud is often impractical. High-bandwidth communications aren’t always available (just think about drones) and pushing inferencing to the cloud results significant latency making this option not feasible for real-time applications.
Analog systems have also come a long way in their tolerance for changing environmental conditions. In the past, the environment noise could slightly modify processing result. Significant research and development have been completed on analog and digital mitigation circuits in digital processes, which will compensate for environment noise in real-world applications.
While significantly streamlining MAC processing compared to digital systems, analog CIM systems will need additional digital elements to execute a fully trained neural networks. Example, it’s best for functions such as activations and pooling to be executed in digital logic. For example, Mythic complements its analog CIM core with single-instruction, multiple-data (SIMD) accelerator units, RISC-V processor that coordinates operations, network-on-chip (NoC) to route data traffic, and local SRAM to hold data, making an AI inference processor to independently execute a complete DNN model. This type of system is very scalable since it treats each the analog CIM-based core, SIMD engine, and SRAM as independent tile in a processor. By linking the tiles in a processor or multiple processors on a board, the system can ensure that input, output, and intermediate data elements flow efficiently.
Key markets for analog CIM
With the incredible performance, power, and cost advantages of analog CIM systems, we’ll see analog CIM integrated into a wide variety of edge-AI applications including video security, industrial machine vision and automation, and autonomous robots and drones.
For the video security market, edge-AI applications can be very useful to protect people’s safety and assist with loss prevention. Consider how security cameras might use AI algorithms to detect shoplifting incidents in real-time, or how an airport might want to detect when suspicious items are left unattended. Edge-AI applications with analog CIM not only process information instantly, but they can also help to protect people’s privacy. Unlike legacy systems that need to send entire video streams to a central processing system, analog CIM systems can process information at the edge so only the metadata of security incidents need to be sent to the command center. This helps to alleviate privacy concerns of surveillance, while still protecting the public.
In the industrial sector, there is a growing demand for computer vision applications that can be used for quality control and safety. Analog CIM systems can be used on the assembly line to help identify defects and other production issues in real time. In the future, we’ll also increasingly see AI-powered robots that work side-by-side with humans to transport goods and perform repetitive and strenuous tasks. To ensure the safety of the workers, the robots will need to process information in real time at the edge – a perfect use case for analog CIM systems.
Finally, drones are another key market for analog CIM systems. While there has been a lot of hype around drones over the past few years, mainstream computing approaches aren’t able to meet drones’ unique performance and power requirements. Since digital systems are so power-hungry, this has limited the flight times of drones. Additionally, it’s difficult for digital systems to run complex AI networks. With analog CIM solutions used in combination with digital systems, drones can process multiple large, complex DNNs with a fraction of the power of traditional systems.
To realize the full potential of the AI industry, there needs to be a 100-1000X improvement over current digital approaches. Since the pace of improvements in digital systems has slowed, analog CIM systems provide the only path forward to meet the power, performance, cost, and size demands of AI applications. In the future, we’ll see even more advancements in analog, including analog compute enabled in NAND flash and RRAM, in addition to integrating 3D memory technology with advanced chip processes. We look forward to seeing analog compute drive a new era of AI innovation in the coming years.
This article was originally published on Embedded.
David Kuo is the Senior Director of Product Marketing and Business Development at Mythic Inc, a manufacturer of high-performance and low-power AI accelerator solutions for edge AI applications. He is working to bring Mythic’s innovative analog compute technology to the industrial machine vision, automation, and robotics markets. David has over 20 years of experience in product marketing, product management, and business development experience in Industrial/Consumer IoT, Mobile, and Consumer Electronics markets.