64-bit, Linux-capable processor to speed up compute-on-storage

Article By : Nitin Dahad

Cortex-R82 is Arm's highest performance Cortex-R processor with 64-bit support and Linux capability...

Arm has announced the Cortex-R82, its first 64-bit, Linux-capable processor for real-time compute on storage capability in solid-state drives (SSDs), hard-disk drives (HDDs) and built-in storage solutions as well as computational storage applications.

Real-time embedded systems such as SSDs have historically required less then 4GB of DRAM and addressable space and have not needed to run Linux. With continually increasing storage capacities and performance requirements saturating throughput of storage host interfaces, the 4GB limit and inability to run Linux are adding complexity, and in some cases, becoming barriers.

The Cortex-R82 processor, a 64-bit processor capable of addressing up to 1TB of address space is optimized for such systems, enabling higher performance, real-time compute with more addressable space and the ability to run Linux for the next generation of computational storage devices.

Its Linux support paves the way for simplified computational storage architectures and flexible system on chip (SoC) designs that can reallocate compute resources dynamically based upon changing workloads or different products.

Arm Cortex-R82
Arm Cortex-R82 compines MPU and optional MMU in a single core, allowing high-level operating systems like Linux to execute (Image: Arm)

Arm said the Cortex-R82 is the first Arm processor that combines both real-time contexts and memory management unit (MMU) based contexts in a single core. In traditional Cortex-R real-time behavior, a Cortex-R82 core can still be configured with a memory protection unit (MPU) to run bare metal and RTOS. In Cortex-R82, that same core can also be configured with an optional MMU to allow a high-level operating system, like Linux, to execute.

Both the real-time and MMU contexts can be handled by the same core simultaneously, or selected cores in a cluster can be dedicated to real-time or Linux, which increases the flexibility of an SoC design to accommodate multiple products and markets. This choice is handled by software and can even be changed dynamically, enabling the balance to be dynamically adjusted depending on demand.

Cortex-R82 has three exception levels (ELs). EL2 is the highest level that enables a secure enclave and separation/isolation of virtual machines for OEM code and customer code. More specifically, an MPU context running at EL2 handles context switches between MPU and MMU contexts at EL1 with OEM and/or OS code while user code runs at EL0. Linux can be running and when a real-time event occurs, the processor can switch to handle the real-time event, then switch back to Linux. The security enables isolation of the main firmware and enables end customers of Cortex-R82 based devices to add custom software, either real time or Linux based.

The processor’s 40 address bits allow it to directly address up to 1TB of addressable space, which enables very large memory or device real-time systems and improved performance over windowing solutions.  This large address space can be accessed either over AXI or CHI to enable additional capabilities including atomics and cache stashing.

The Cortex-R82 processor provides a performance uplift over Cortex-R8 on standard benchmarks and even higher uplift on actual partner code. Partner code execution is showing 74-125% performance uplift compared with Cortex-R8. The Cortex-R82 processor also provides a 21% performance uplift over Cortex-A55 when running SPECINT2006 benchmarks. The performance uplift satisfies the most demanding real-time embedded workloads and easily runs full Linux distributions.

Cortex-R82 performance uplift
Using the Arm Compiler 6.14 with O3 as optimization level, the EEMBC Consumer benchmark is significantly improved thanks to the Neon SIMD instructions. The actual Customer code benchmarks that show 74% to 125% improvement over Cortex-R8. (Image: Arm)

The Cortex-R82 processor optionally includes the latest Neon instructions to greatly accelerate machine learning (ML) workloads with capabilities such as Dot Product support. This is especially useful for computational storage where the Arm compute library and Arm NN library can be accelerated by Neon, for example to search for a specific image in a drive full of images.

Single core adjusts to workloads based on demand

The ability to run both real-time and Linux on the same core or cluster of cores is key in emerging technologies such as computational storage. Real-time capability is required for data transfers through the SSD, just like traditional SSDs. Running Linux and associated software tools directly on the drive facilitates computational workload management and filesystem recognition to perform the on-drive computation and generate insight on the drive greatly reducing data movement, latencies, and energy consumption.

Storage vs computation workload Arm Cortex R82
The same Cortex-R82 core can be used to adjust the types of workload running on a storage controller. Hence the same product can be dynamically configured through software to run SSD functions during the day and switch to computational storage at night. (Image: Arm)

This same capability could be achieved with a cluster of Cortex-R8 cores, for example, and a cluster of Cortex-A cores for Linux, but the overall system architecture is simplified with Cortex-R82 since it can handle both. This reduces die size, cost, and most importantly, enables flexibility. The same SoC can be used for an ordinary enterprise SSD and reconfigured for a CSD product, saving the large mask-set costs in smaller processes to create multiple SoCs. The same product can even be dynamically configured through software to run SSD functions during the day and switch to computational storage at night.

Development tools

Arm has a suite of technologies and tools to support, speed up, and reduce risk of the development of Cortex-R82 based storage controllers. Arm Development Studio and Fast Models enable early hardware and software co-development and Cycle Models allow custom benchmarking and performance optimization ahead of silicon availability. Training and design review services and Cortex-R82 Artisan Physical IP and POP IP are available to help accelerate time to market and reduce risk. Arm is developing a TSMC 7FF POP to deliver the best PPA required for Cortex-R82 use cases.

Leave a comment