By Larry Cleland, Director Sales & Marketing
On April 2, 2020, the Compute Express Link™ (CXL) Consortium and Gen-Z Consortiums announced that they have signed a Memo of Understanding (MOU). The announcement indicates that the two Consortiums intend to work together in joint working groups to define how CXL and Gen-Z will interoperate together to solve some very big issues facing the Cloud and IT world. It didn’t take long to get to work. Within a few days of the MOU announcement joint working groups were already established and meetings underway. The announcement of this MOU has resulted in several industry publications declaring it to be significant.
So, why is this MOU significant? There has been a great deal of confusion in the industry the past several years with multiple high speed “next generation” bus solutions emerging; CCIX, OpenCAPI, Gen-Z and most recently CXL. With all of these capable options, there was still no clear acknowledgement of which was going to be successful. The emergence of CXL and the broad industry support it has generated over the past six months has consolidated the processor connection solution. CXL, while being a very good solution for the cache coherent connection to the CPU, is not optimized for the full solution needed in the datacenter for many end users. CXL sits on top of PCIe Gen-5 so it has the PCIe limitations that impact it’s reach, switchability and performance.
The CXL and Gen-Z MOU has helped to clarify the picture for node-to-node connectivity and brings the fabric solution, which is being jointly supported, into focus. Gen-Z is complementary to CXL and will provide the memory semantic switched fabric node-node interconnect with the ability to support the disaggregation of large memory pools and a fabric connection point for GPUs and other devices. Some large cloud providers have indicated that about 50% of their server spend is for memory. Today memory access is not flexible and is hard allocated to a particular server based on local DIMM provisioning. The ability for fabric attached memory to provide large and multi-server accessible memory pools will provide the ability for the datacenter to provide memory, as a bit higher latency tier, to the server job scheduler when, where and at the capacity that is required. The disaggregated solution made available with the Gen-Z fabric, provides several additional advantages to today’s local memory architecture. 1) the reach has been expanded to the rack, row and in the future, the datacenter; 2) the provisioned memory is no longer tied to one memory type that is installed in the server DIMM slots. Various memory types: DDR4, DDR5, various persistent memories, etc. can be made available to fabric attached devices; 3) the memory can be made available to all compute nodes on the fabric and can be dynamically allocated as needed; 4) Gen-Z Memory Modules (ZMMs) can be “hot plugged” to allow increasing/decreasing the amount of memory in a particular memory pool without having to take servers offline. This provides more flexibility to the datacenter to adjust to compute loading and scheduling requirements; 5) the memory pool is now CPU agnostic having removed it from the server box and can be accessed by any processor capable of joining the fabric via a CXL to Gen-Z adapter.
In summary, yes the MOU between CXL and Gen-Z should be seen as significant for the industry in several key areas: 1) it provides a direction for the industry from a processor connection and datacenter wide fabric standpoint with significantly higher bandwidth and lower latency that what can be achieved today; 2) while implementation remains to be accomplished, the top four CPU providers, Intel, AMD, ARM and IBM, have all endorsed CXL as a low latency and common processor connection point; 3) it provides CXL with a path to expand to a fabric that will provide processor customers which incorporate CXL with expanded capability and performance solutions they would not otherwise have in the foreseeable future; 4) it provides the datacenter operations team with greater flexibility to control costs through sharing of resources: CPU selection, memory (of various tiers), GP/GPUs and storage.