CXL is an example of heterogeneous computing. It's about composability and flexibility without the need to over-provision.
The rapid emergence of the Compute Express Link (CXL) specification is an excellent example of heterogenous computing — but not all heterogenous computing is necessarily CXL. Rather, it’s about connecting to whatever mix of compute, memory, and storage will best tackle a given workload, without out the need to over-provision.
While the new protocol has as quickly gained traction to provide more efficient access to resources including memory, CXL is part of a broader trend in computing overall as data becomes less centralized and gets pushed into the edge to be used in more diverse workloads and a wider variety of devices.
It may sound flashy, said Ryan Baxter, senior director of Micron Technology’s cloud, computing and networking business unit, but heterogenous ultimately means it’s not monolithic and no longer just a standard memory connected to an x86 CPU. “You can take that tool to battle, but it may not be the most efficient anymore.” While it’s possible to do machine learning training using x86 servers, he said, “it’s not an architecture well-suited to be able to tackle that kind of problem.”
Heterogeneous computing takes a more parallel approach by leveraging many cores and different types of accelerators connected to very high bandwidth memory. It’s not even about building a specialized x86 server. For more than a decade, said Baxter, data centers have functioned based on the notion that you can spin up a server to tackle any problem at that point in time by provisioning the necessary hardware and resources without having to purpose-build anything. He said CXL is an example of how the existing connectivity within the data center can be leverage the composability you would need at the hardware level to be able to solve tomorrow’s most interesting problems.
Baxter said there will continue be innovation around x86 architectures. “It’s the standard workhorse for getting more bandwidth, but what’s needed is optionality to take a different tool to the battle.” AI training or video transcoding can benefit from more purpose-built hardware, which will emerge in the cloud first simply because of the sheer amount of workloads that get thrown at data centers. “It’s really an evolution of the workloads and use cases that’s driving the need for this heterogeneity at the hardware level.”
One of the key challenges CXL addresses is that memory has become the bottleneck, and the answer isn’t just to have faster memory or more of it. It’s about getting the data to the right memory for the use case as easily as possible without over-provisioning. Similarly, heterogenous computing will see data centers and hyperscalers looking to get the most out of their hardware. Baxter said some of the larger cloud customers are already heading in this direction. “They’re keenly aware of how much they’re utilizing underlying hardware.”
Not only are 86x servers no longer the only answer, but neither will be over-provisioning hardware, such as deploying 20% more flash SSDs than required. Quality of service (QoS) levels must still be maintained but with increased utilization and less over-provisioning, which is expensive. Baxter said a new interface such as CXL enables access to pools of hardware for workloads that require a little more than what the baseline configuration can provide. “You’re still going to have CPUs. You’re still going to have the memory,” he said. “But there’s some important changes coming on the server side that are going to necessitate the move to new interfaces like CXL.”
Besides different pools of memory and storage, accelerators are going to be a key part of making sure workloads have purpose hardware. Pliops’ developed its Extreme Data Processor (XDP) technology because it recognized that legacy approaches can’t keep up the exponentially growing need for data storage capacity as well as the computational requirements for processing it. Company president Steve Fingerhut said the success of GPUs has show there’s value building something purpose built to accelerate certain kinds of workloads.
Adding more standard servers and drives isn’t the answer, he said. An accelerator such as XDP increases performance, reduces costs and allows for an overall smaller footprint, said Fingerhut, by working in combination with the CPU. “We could call it a co-processor, but it is something that compliments the platform that everybody is buying today.”
The Pliops solution addresses storage stack inefficiencies that are the result of adding more and more cores to the same memory bus as processors struggle to keep up with more and more data that’s typically being stored on SSDS, said Fingerhut. It’s a key-value (KV) based storage hardware accelerator that can work with any SSD to workload performance and optimize SSD usage and geared for workloads within databases and software defined storage. It’s also heterogenous in that it can be a single solution for commonly used database applications, including RocksDB, MySQL and MongoDB, and leverages the NVMe KV standard, as well as PCIe. It’s not constrained by specific flash types or server models, he said. “Once you start using XDP, you can use it everywhere that flash is deployed. It works with any SSD and accelerates essentially any flash-based application.”
Pliops is using existing interfaces and protocols to make it easy to integrate with maximum flexibility, much like CXL is using PCIe to pull from a broad of pool of resources. “We’re taking full advantage of the NVMe performance latency scaling as well as NVMe over Fabrics or TCP.” Fingerhut said XDP becomes a third processor in the system that frees up analytics from inefficient host software because less data needs to be transferred and operations are efficient as possible.
Pliops isn’t alone in developing purpose-built accelerators that can be part of a resource pool in a more heterogenous computing environment. Fungible Inc.’s Storage Initiator (SI) cards allow standard servers to access NVMe over TCP (NVMe/TCP) storage targets, while the Fungible Data Processing Unit (DPU) is a processor purpose-built for data-centric workloads and unlock capacity previously stranded in siloed servers. Last year, Micron unveiled its heterogeneous-memory storage engine (HSE) aimed at getting more from SSDs and other storage-class memory (SCM), by enabling developers using all-flash infrastructure to customize or enhance code for its unique use cases. Similarly, Kioxia America Software-Enabled Flash (SEF) combines software flexibility, host control, and flash native semantics into a flash native API and purpose-built controller to make flash easier to manage and deploy across a PCIe connection.
The ubiquity of PCIe and rapid adoption of CXL, already in its second iteration, are key enablers of heterogeneous computing. “[CXL] allows things like accelerators to talk to hosts in more of a peer-to-peer fashion,” said Rambus fellow Steve Woo. CPUs and GPUs sometimes need to communicate back and forth through memory, and CXL provides the necessary coherence, which makes programming models much easier in heterogeneous environments.
With the transition to DDR5 next year, there’s more bandwidth available, but also a great diversity of workloads, which in turn means a greater diversity of compute environments available on platforms such as Amazon Web Services and Microsoft Azure, said Woo. “They’re so diverse in terms of their compute capabilities, the amount of memory and the amount of disk you’re allowed to have.” It’s now possible to gang together CPUs to solve problems that don’t fit individual servers anymore, he said. “You need to find ways to expand things like the memory bandwidth and the memory capacity to meet these needs.”
Woo said it will be advantageous for data centers to scale out differently with dis-aggregation of resources with separate pools of memory, storage, and accelerators. One workload may be more memory intensive and only require one CPU resource, for example, which is why there’s a lot of talk about grouping types of resources together in their own pools. “You just grab what you need and compose your resources based on the workload characteristics,” he said. “When you’re done, you put it all back in the pool. It’s kind of like a library, and you’re just kind of checking out the resources you need, and you just put them back when you’re done.”
Woo said everyone would prefer to do more with the resources they have. “By not over-provisioning things and allowing them to be used on the jobs as they’re needed, it allows your data center to do more work in that volume of space.”
Jeff Janukowicz, IDC research vice president for solid state storage and enabling technologies, said overall there’s common need to optimize the many compute, storage and memory resources as much as possible, whether it’s through CXL or heterogeneous computing by taking advantage of standardized interfaces. He said it took a while for NVMe to make inroads in the market because software needed to be written and an ecosystem built up around it, and by leveraging that existing ecosystem, an emerging protocol such as CXL is more easily and quickly adopted.
It also allows for maximizing the resources in the system and means less over-provisioning. “Optimization is clearly a key factor,” he said. Often resources such as storage are over-provisioned to account for peak workloads, but a more flexible architecture can optimize those resources in both a peak environment and more mainstream environment. “It’s really going to help you to optimize your costs across the stack.”
Micron’s Baxter said part of the optimization heterogenous computing provides is through interoperability. “The degree to which the protocols and the interfaces are standard across multiple implementations is important.” But memory is also becoming even more critical part of any heterogeneous system. “It’s the element drives a certain degree of performance and efficiency with which you tackle a workload,” he said. “Customers are really wanting to understand how they can use that memory in a more creative way. Heterogeneous means there’s more tools in the toolbox with which you can go tackle the workload.”
This article was originally published on EE Times.
Gary Hilson is a general contributing editor with a focus on memory and flash technologies for EE Times.