Compute Express Link initiative addresses computing memory challenges

Article By : Jeff Hockert

Compute Express Link allows developers to use persistent memory options by dialing in memory bandwidth ideal for their application.

In the world of computing, one of the unexpected things to marvel at is the rapid adoption of artificial intelligence (AI) and cloud computing in data centers. These and other forces are driving heterogeneous computing—the use of CPUs, GPUs, FPGAs, ASIC accelerators, network interface controllers (NICs) and other processing elements, all connected to ever-larger pools of memory.

However, high-performance computing (HPC) needs an update in the ability to efficiently connect these processing elements, and to share increasingly expensive memory. Enter the Compute Express Link (CXL) initiative, formed to deal with the challenges brought by heterogeneous computing. It’s aimed at providing cache coherency as well as the ability to add new layers of memory without unnecessary costs.

The CXL consortium has attracted widespread support, with nearly 100 member companies and a 14-company board of directors that includes nearly all major vendors, including AMD, Arm, IBM, Intel, and Xilinx.

Much as USB, PCI, and PCI Express were initiated by Intel, the CXL consortium jump-started when Intel contributed the first iteration of the technology to an initial working group of nine companies. In September 2019, a board of directors was formed, with a much larger set of 96 member companies.

Before committing to become a full contributing member, any company can visit the CXL site, get a click-through license free of charge, and download an evaluation copy of the current version of the specification. Contributing members are able to guide the evolution of the effort, and engineers can receive training sessions on the 2.0 specification, now in the development stage.

CXL builds upon the PCIe interconnect standard, and CXL 2.0, available as of December 2020, will complement the PCIe 5.0 standard as it comes into use, with its 32 Gbps signal speeds per lane, expected to be released in 2021. The subsequent PCIe 6.0 spec is expected to be twice as fast. CXL, featuring a suite of three protocols, takes advantage of the ability within PCIe to have alternate protocols.

The first, CXL.io, replaces PCIe and handles the standard setup functions. When a CXL card is inserted into a PCIe slot, CXL.io recognizes that CXL is in use, shuts off PCIe, and initiates the CXL.io protocol. It allows a system to use the same set of wires and standard PCIe slots, and to mix CXL and PCIe resources as needed, which provides an important means of conserving resources and system costs.

The second and third protocols—CXL.cache and CXL.memory—support the ability to maintain cache coherency, reduce latencies, and use new memory types going forward, among other advantages.

diagram of the CXL transaction layerFigure 1 The CXL transaction layer comprises three dynamically multiplexed sub-protocols on a single link. Source: Intel

In many ways, CXL is about driving heterogeneous computing, which is where much of the innovation in computing is coming from. In today’s heterogeneous computing world, memory is attached to the CPU, and other banks of memory are attached to the accelerator devices: GPUs, custom logic, FPGAs, NICs, and the like. These pools of memory reside in two different domains, and the different classes of devices talk to the memory with different mechanisms. Maintaining cache coherency is challenging.

The CPU-attached and accelerator-attached memory pools have PCI peer-to-peer access. With CXL and its memory-centric architecture, it brings in memory architecture and memory semantics to what was traditionally on an I/O bus.

Using PCIe’s alternate protocol

CXL takes advantage of the alternate protocol option within PCIe. When we use CXL, PCIe shuts off, CXL takes over, and gives us memory-class latency as opposed to I/O-class latency.

In a data center, CXL operates primarily at the node-level layer of the interconnect architecture for chip-to-chip interconnect. For the rack and row levels, the open systems Gen-Z interconnect can provide memory-semantic access to data and devices via direct-attached, switched or fabric topologies.

CXL and Gen-Z are very complementary, with the former used in the node, and the latter outside the node. From a CXL standpoint, Gen-Z could help us be very fluid, and we see a lot of synergy when both are deployed. In fact, we believe that complementarity is a trend that is going to grow over time.

CXL has a great relationship with Gen-Z, including a formal agreement. The intent is to allow Gen-Z to attach to CXL very efficiently. Any fabric needs a coherent interface to the CPU if engineers want it to work reliably. So, it made sense to have Gen-Z operate more at the rack level and row level, sitting above CXL.

Asymmetric complexity is key

One capability within CXL is an ability to “bias” the compute resources in the system asymmetrically. To maintain cache coherency, operations would most often maintain a “CPU bias,” since that is where ensuring cache coherency most often takes place, at the “home agent” of the CPU. Accelerators, which work with a particular class of data most of the time, would use a somewhat simpler “device bias.”

With this asymmetric approach, CXL provides the benefits of cache coherency without getting bogged down in the intricacies of the home agent on the CPU. To summarize, asymmetric complexity is a key feature of CXL, which eases the burden of cache-coherent interface designs.

diagram of using CXL for cache-coherent interface designsFigure 2 The asymmetric complexity in CXL eases the burden of cache-coherent interface designs. Source: Intel

Reducing complexity in CXL will make it possible for processors from different vendors to readily establish coherent caches, a first for the industry. The concept of splitting up the complexity, taking advantage of complexity in the CPU and not duplicating it in the accelerator, is a key value proposition of CXL.

In CXL, we start with CPUs, with cacheable memory both North and South, both to its own memory and to the accelerator memory. The accelerator has exactly that same capability as well. And the PCI devices, which had access to CPU memory, now have access to the accelerator memory under CXL. We get this symmetric capability to both portions of memory. Those two pools would be part of the coherent memory pool addressable by both machines.

Adding memory capacity affordably

Inevitably, data center systems need increasing memory capacity and bandwidth. One solution is to add a class of persistent memory between DRAM and solid-state drive (SSD), in many cases large enough to store an entire database. This separate memory tier—less expensive than DRAM—could be used in a variety of storage innovations.

CXL defines three types of devices: Type 1 includes accelerators that have their own cache memory but don’t have attached memory. Type 2 class of devices includes accelerators that have attached memory. In both cases, cache coherency is guaranteed.

The third type of device includes controllers supporting memory buffers and memory expansion devices. A system could add more DRAM and/or persistent memory but move it off the DDR interface. A memory buffer would appear, logically, no different than if it were on the main memory bus.

Enabling disaggregated memory

CXL is a high-speed interconnect, and the consortium has worked toward significant latency reduction in order to enable disaggregated memory. Creating shared memory pools with efficient, low-latency access mechanisms is in line with the consortium’s over-arching goal of heterogeneous resource sharing.

Support for Type 3 devices in CXL provides an opportunity to separate the memory controllers. As data centers deal with a wide range of use cases, Type 3 devices could more easily provide access to persistent media or new memory types yet to be put into production.

Merely adding more direct-attach DRAM is proving to be too expensive. Not only are DRAM costs scaling at a much-slow pace, but the complexity of more routing layers on the PCBs and more pins on the controllers are also expensive.

Rather than adding more memory on the board, scaling up the number of CXL links is a much simpler approach which doesn’t rely on parallel high-speed buses. Parallel DDR interfaces require 200+ pins, while CXL enables fewer pins per package and lower PCB layer counts. With CXL serial interfaces, the memory can reach longer in more optimal locations, which changes the air flow over the memory devices.

Vendors could build CXL memory expander devices with media-specific controllers. A system could support a variety of different memory types, including DDR3, DDR4 or DDR5, as well as persistent memory, low-power DRAM and so on, each having a media-specific controller supporting asymmetric or non-deterministic timing and error handling. A slower memory tier can be completely isolated from the main tier, with minimal interference to direct-attached DRAM dual in-line memory modules (DIMMs).

diagram of CXL uses and benefitsFigure 3 Representative CXL usages simplify the programming model and enhance performance. Source: Intel

Using CXL, developers can dial in memory bandwidth that is ideal for their application, use persistent memory options, and mix-and-match as the application requires. The consortium’s goal is to bring together many different industry players to ensure a robust, growing ecosystem. We do need to work through interoperability, but while we have a good track record with PCIe, we also need to work through power, mechanical, and management interfaces to build a robust CXL ecosystem.

This article was originally published on EDN.

Jeff Hockert is a senior marketing manager in the Technology Leadership Marketing Team at Intel.

Other articles in this series:

Related articles:

Leave a comment