Ride along with the team designing a 16-core processor aimed squarely at high-performance datacoms.
Our company (Freescale at the time, before we formed part of NXP) was faced with a challenge: We had built and were sampling initial silicon of an 8-core 2 GHz processor, the LS2088A, but customers always want more. Our discussions went something like, “How much more? Do we really need more? And, if the market does need more, more of what?” What to do next? We had to pinpoint what was missing in the market and in our portfolio, and most importantly, which emerging applications and use cases did we want to pursue?
Just throwing more of everything onto the next chip was one option, but a more measured approach won out. This article dives into the thought processes behind the LX2160A, and how we solved important challenges. It might be a 2 GHz-class processor, and it might have double the number of cores of the LS2088A, but it definitely is not just “more of everything.” We’ll describe how we approached I/O interfaces, based on the applications we wanted the processor SoC to excel in, and how that drove aspects of the SoC architecture such as cache, SRAM and DRAM bandwidth, and more. We will discuss why we partitioned specific workloads between accelerators and general-purpose compute cores and, in doing so, how the LX2160A achieves performance goals in workloads ranging from wireless transport protocol processing to software defined storage systems for use in cloud data centers.
Our starting point: LS2088A
The LS2088A device has eight Arm Cortex-A72 cores, each running at 2.0 GHz. In 2016, the Cortex-A72 was Arm’s highest performing core. It also had the bonus of consuming less power than its predecessor Cortex-A57. It is rare for a core to improve upon the diverging targets of both performance and power simultaneously.
For each pair of Arm cores, we allocated 1 MB of L2 cache. Complementing this 4 MB of total L2 cache, the LS2088A has 1 MB of L3 platform cache. We chose a relatively small L3 cache in LS2088A so that we could devote as much of the chip’s area as possible to the L2 caches, which are lower latency when accessed by the core and therefore have a larger impact on core performance than the L3. To access main memory, we endowed the LS2088A with two 72-bit DDR4 controllers, each operating at 2.1 GT/s, plus a third 36-bit DDR4 controller intended for use by the Ethernet datapath. In the LS2088A’s target applications of network packet processing, significant bandwidth is needed by data structures (both packet data and routing tables) private to the Ethernet datapath, which led to the presence of the third DDR4 controller.
The LS2088A design had 16 SerDes lanes for external high-speed I/O, which could be configured to support up to four PCIe Gen3 controllers (PCIe Gen3 being the fastest speed available at the time), and 16 Ethernet MACs (eight of which supported up to 10 Gbps). In terms of acceleration, the LS2088A has specialized compression, decompression, pattern matching, and security coprocessors, as well as a programmable AIOP (Advanced IO Processor) for autonomous packet processing. The AIOP was targeted at network routing and forwarding applications where Ethernet packets went through potentially multiple table lookups and header manipulations.
With that architecture, LS2088A is capable of aggregate core performance of approximately 100,000 CoreMark or SPEC CPU2006-Int of 81. It is capable of 40 Gbps of DPDK IPv4 simple forwarding at 128-byte packet size. Or, utilizing the AIOP, the device is capable of 19.4 Gbps of complex IPv4 forwarding (complex forwarding being a use case with three exact match lookups, one longest prefix match lookup, and one 5-tuple access control list lookup per packet). Most importantly, the AIOP could achieve this rate fully offloaded from the CPU datapath, with zero loading of the Cortex-A72 cores.
For the LX2160A, we were looking to provide support for a range of emerging applications. Whereas we optimized the LS2088A for networking and wireless infrastructure, we wanted our next product to serve well in wireless infrastructure (which is moving from 4G LTE to 5G), network function virtualization (NFV), mobile edge computing, and new types of datacenter offload and storage applications.
For these applications, we knew that core performance would remain important. Regardless of an application’s ultimate use, it is rare to see extra core cycles go unloved. Both NFV and edge computing applications place an increasing requirement on core performance for higher-level applications in addition to the tasks of data or network packet movement of the device. However, if too much core performance was added, it would not be usable because all of those cores would be waiting for access to the memory subsystem. Often, this is referred to as “hitting the memory wall.” We therefore had to first calculate how much DRAM bandwidth our target applications required within our cost constraints. The goal became: provide as much core performance as possible to utilize that bandwidth.
We had a few DRAM technologies to choose from: LPDDR4, GDDR, HBM, and DDR4. LPDDR4 provided good bandwidth, but because it is a point-to-point technology with no concept of multiple banks of chip selects sharing a common data bus, and also because it is fundamentally a ×32 technology (data bus width per chip) rather than the ×16, ×8, or ×4 technology of DDR4, the maximum DRAM system capacity achievable with LPDDR4 would be too small for our needs. The various GDDR flavors also provided good bandwidth, but their stumbling block was that they only achieve full bandwidth for long sequential accesses, something that could not be guaranteed in this system, with core-initiated cache-line-sized transactions and potentially small Ethernet packets.
We needed the DRAM technology to also have good performance when accessed coherently and when accessed from the core (both of which result in cache-line-sized transactions) – hence consideration of this in DRAM technology selection. We also needed this DRAM to also operate well with small Ethernet packets, as in many applications, the system needs to be able to respond with sufficient performance for any size Ethernet packets that may be received. 3D-stacked DRAM on the same package substrate as the main SoC (such as HBM) was also considered, but it would have added substantial package cost, and, because DRAM would need to be embedded within the package (rather than being a system-dependent design parameter), it would also limit DRAM capacity. Also, because our processors go into various applications, predicting the exact size of the memory prior to production was a challenge.
That left DDR4, which we were already experienced with, as the technology of choice. The good news is that since the launch of the LS2088A, 3.2 GT/s DDR4 was on the horizon. The same number and width of DDR4 interfaces could therefore deliver 50% more throughput that the 2.1 GT/s DDR4 interfaces of the LS2088A. To keep both our customers’ PCB design costs and product costs reasonable, more than two DRAM controllers was not viable. We, therefore, decided to support two 72-bit DDR4 interfaces at 3.2 GT/s.
[Continue reading on EDN US: Cores & cache]
—Ben Eckermann is a Technical Director and Systems Architect for Digital Networking at NXP. Ben is currently leading systems architecture and technical requirements for QorIQ processors built on Arm technology. He has designed and architected low-power products for NXP (and formerly Freescale and Motorola) for over 15 years.