We recently published a Webinar about the benefits of CXL to reduce latency. After exploring the following topics:
- PCIe Data Flow
- PCIe Latency Breakdown
- CXL in a Nutshell
- CXL.io Flow
- CXL.io Performance
- CXL.mem/CXL.cache Overview
- Benefits of CXL.mem/CXL.cache
- More Latency Savings with CXL.mem/.cache
- CXL.mem/.cache Latency Contributors
- CXL Read Flow & Transfer Latency Estimation
- Other Factors That Impact Latency
- What Use Cases for CXL?
- Optimizations in PLDA CXL Controller IP
We received many questions in both Chinese & English and thought this blog post would be a good way to provide answers. If you were unable to attend the live webinar, the full recording is available here.
Doesn't it look like in order to decrease latency, we are adding resources on CXL devices,, such as cache agents and more..?
At a high level, CXL trades more hardware resources for less software management (as opposed to PCIe), so at the system level, we will see a decrease in latency. Keep in mind that cache agents reside on the Host side and not on the device side (asymmetric coherency as opposed to CCIX) and typically run at multi-gigahertz speed. At a high level, CXL trades more hardware resources for less software management (as opposed to PCIe), so at the system level, we will see a decrease in latency. Keep in mind that cache agents reside on the Host side and not on the device side (asymmetric coherency as opposed to CCIX) and typically run at multi-gigahertz speed.
How about the comparison of latency between PCIe 5.0, CCIX and CXL?
The CXL.mem and .cache data link and transaction layers enable a lower latency than the equivalent PCIe layers, thanks to flit usage and other design options and other design options .
CCIX traffic is encapsulated into PCIe TLP, so CCIX latency is larger than PCIe. So in terms of relative latency values, we have: CXL.mem/cache Latency < PCIe Latency < CCIX Latency ..
Is there an application for using CXL on SSDs?
Storage applications can leverage CXL.mem low latency (for persistent and non-persistent memories).
Additionally, CXL.cache can also allow NVM devices to offload CPUs by allowing allowing coherent DMA accesses to the host memory space..
Can this be used to improve the latency for device CPUs accessing the host memory buffer?
Yes, CXL.cache is designed for efficient access from accelerators to host memory (as well as the other way around)
The built-in coherency allows performant data exchange between host and device spaces.
Is there a separate certification needed for CXL devices?
The CXL Consortium are organizing CXL-specific compliance workshops.
For PHY developers, what differences are there between PCIe and CXL in terms of PHY implementation?
The CXL stack leverages the existing PCIe PHY layer, so the existing PCIe 5.0 PHY can be used transparently for PCIe or CXL. That being said, it is generally recommended to support sync header bypass, drift buffers and serdes architecture to achieve the lowest latency possible.
In layman’s terms, is this an enhancement for PCIe or a replacement?
PCie and CXl are different protocols that coexist at the moment, as they meet different needs.
PCIe is widely used across verticals, while CXL may find its sweet spot in specific applications that require lower host-to-device and device-to-host latency.
Wasn't PCIe alone sufficient to provide cache-like features such as using ATS Capabilities present in PCIe?
No, CXL.cache allows coherency at the system level, between devices and host. While ATS is an address translation service, with built-in cache for translation tables - they are two different things with different purposes.
Does CXL have a different link training scheme than PCIe? Or does it work off the PCIe link training scheme?
The CXL training scheme is essentially the same as the PCIe one.
The only difference is the use of specific Training Sets to perform Alternate Protocol Negotiation.
How different is CXL 2.0 from CXL 1.1 for host and device
We are not at liberty to comment on future CXL specifications. Contact the CXL consortium for further details..
I understand that Cpl for MemWr in PCIe is avoided for performance improvement with a tradeoff.
Can you please help me understand why CXL spec writers chose to incorporate a completion for posted write and what impact this has on performance?
This statement is not true in all cases. Some write requests may be associated with a response, others won't. (depending on the functionality and performance requirements).
For example, D2H MemWr request on CXL.mem does not expect a response.
I understand that ~50ns of PCIe delay was captured across some values of Maximum Payload Size (MPS) and Maximum Read Request Size (MRRS)
What would be the impact of variation in Maximum Payload Size (MPS) and Maximum Read Request Size (MRRS) on performance over CXL?
MPS and MRRS are PCIe concepts that only apply to CXL.io, they don't exist in CXL.mem/.cache transactions.
Since the objective is to use .io for control/legacy and to reach high traffic on .mem/.cache, there is no impact of MPS/MRRS on performance.
That being said, other parameters can impact the flit packing efficiency, and hence performances (e.g. support of MDH).
PCIe uses NTB. How can CXL be considered?
In CXL 1.1 there is no provision for an NTB-like mechanism that would allow cross-domain communication, though it’s possible future versions of the CXL specification may introduce such a mechanism.
Did you mention that CXL 2.0 supports switch and only supports 1 layer?
We cannot comment on future CXL specifications. Please contact the CXL consortium for information regarding future CXL specifications.
How do CXL and Gen-Z work together?
Inside the box (at the node levelat the node level), CXL can be used to connect a CPU and other memory or accelerators. Outside the box (at the rack or row level, or for long haul)), Gen-Z can be used to connect multiple systems together.
Will it support atomic operations?
Even though the CXL 1.1 specification does not specify atomics, the coherent nature of CXL.cache and CXL.mem means that atomics are fully supported. For instance, in Device Bias mode a device could perform any atomic operation on its cache lines before releasing ownership to the host.
Do you see Enterprise SSD directly support CXL?
Many CXL consortium members are in the storage space and could be looking at integrating CXL, though CXL.mem/CXL.cache will likely benefit from new types of storage devices such as Storage Class Memory (SCM).
Is CXL suitable for two symmetrical system connections (such as two mesh connections)
No, it is not suitable for connection between two meshes, this is because it is an asymmetric protocol.
Does CXL spec only support PCIe Gen5 devices or is it also compatible with Gen4, and even Gen3?
Yes, it can run on Gen 3 or Gen 4 speed.
Can PLDA CXL IP connect to ARM CMN 600?
The ARM CMN 600 uses the CXS.B interface, the PLDA IP will support the CXS.B interface at a later time .
With CXL, will the app level become the bottleneck of latency instead?
Yes, it is possible.
BUT plda cxl ip support 2.0?
PLDA IP will support CXL 2.0 as soon as it is officially released.
How does CLX compare with CCIX? and Latency?
CXL is an asymmetric interface, so there will be host and device or master and subordinate roles on the link. So CXL is more suitable for a system which has a main CPU to control all the peripherals. CCIX is a symmetric interface, with each end acting as ae master to access the other end, so it is more suitable for connecting two peers. Since CCIX is PCIe based, its latency will be higher than CXL.
Is the newest spec version 1.1 or 2.0?
As of the time this blog is published, it is 1.1.
Could you help show us the cycle count that is cost in each layer of CXL?
This requires a NDA with PLDA to access those numbers.
What are CXL type 1, type 3 devices ？
Type 1 Device is CXL.io + CXL.cache traffic only, Type 3 device is CXL.io+CXL.mem traffic only.
On the controller side will the CXL controller support Original PIPE and SERDES mode?