This wasn't mentioned in the article but CXL Stands for Compute Express Link which is an open standard, CPU-to-Device Interconnect, I guess they're on v3.0 now. First I had heard of it as well. 3.9GB/s on the low end and as high as 121.0GB/s with the latest 16x 3.0 spec over serial connection
I'm surprised that it is not known on HN. This is standard developed to enable different server configurations in the data center from what I've read through the years. The universal interconnect allows for adding and removing devices without special allowances for each device type as far as I get it.
It might work for laptops, but it is not the main goal and it might be only Framework who will do something in that direction.
These days that bulk of us people doing dev work aren't getting to play in the actual DC, it's all so abstracted away. Frankly looking to get a used rack or two at home here, just so I can play with the HW a bit.
(I do embedded and not server stuff mostly these days anyways, but generally interested in systems stuff, so)
CXL is useful for many more things beyond just memory expansion (which is a "type 3 device").
The purpose of CXL is to allow for memory coherency between different CXL devices. To quote the spec on type 2 devices:
> CXL Type 2 devices, in addition to fully coherent cache, also have memory, for example
DDR, High-Bandwidth Memory (HBM), etc., attached to the device. These devices
execute against memory, but their performance comes from having massive bandwidth
between the accelerator and device-attached memory. The main goal for CXL is to
provide a means for the Host to push operands into device-attached memory and for
the Host to pull results out of device-attached memory such that it does not add
software and hardware cost that offsets the benefit of the accelerator.
There's some cool things you can do with CXL, like resurrecting the whole persistent memory idea with low-latency flash, making hardware offload devices more capable since you now get free cache coherency, and a whole bunch of other stuff.
But yes, it's really not for consumer use-cases. The applications I've seen colleagues work on are mostly enterprise stuff like cool RDMA integrations, cache-coherent flash, and more I can't talk about here.
I've got to imagine that at some point in the future when it's been common on server stuff for a while it'll end up as part of a next-gen "thunderbolt" style connection, simply because the silicon by then will be a commodity and it'll be cheaper to use the same thing everywhere. I'm imagining a docking station that doubles the ram, and gives it a powerful GPU/NPU/TPU for a workstation. Since the ram would be close to the accelerator over CXL they can talk directly and make full bandwidth use of it even if the laptop uses the remote ram as a "very very very fast swap" or something else to prioritize things. I'm not so sure it'd be useful for persistent storage at that level but it'd make for some really cool options for a portable workstation since you wouldn't need everything inside the device.
CXL 3.0 is where it gets interesting, where you can start to have switched fabric where many hosts can talk to many devices. Having that pool of ram & GPUs be on demand usable by whomever has some attractive possibilities. Also, one can just imagine having some pools of data in CXL memory that a host can just attach to & read, which seems like a cool possibility.
It does kind of upset me a bit that CXL 3.0 still seems purely host-to-switch-to-device oriented. If you have your formerly PCIe slots on your cores speaking CXL and doing directory memory over fabric, I'd really really love to be able to talk to other hosts. Maybe that happens & is possible in 3.0, but it feels like CXL isnt paving that cowpath, isn't making is obvious, and that there will be a bunch of proprietary nasty ways to bridge computers & chat over CXL that are all non-standard, & I wish CXL had been more direct about making themselves & their upcoming switched fabric viable & interesting for host-to-host.
Yeah optane seemed like it had so much potential -- with the right abstractions, it neatly dealt with the eternal problem of the separation between RAM and disk. Much of DB and storage system engineering comes down to decisions around when to persist what, and optane was very promising in allowing for new architectures that had much better performance at lower complexity.
Alas, optane's dead now. I do know people actively working on resurrecting a lot of pmem work on low-latency flash, however, and it seems like this is one area with a low of momentum behind it.
Rack level disaggregated compute-memory-storage-accelerator architectures allow for dynamic partitioning/aggregation of hardware to suite concurrent workloads and their evolution over time and make it easier to achieve both cost efficiency as well as incremental, continuous and non-disruptive hardware upgrades.
It's been a long day, managed to take a few seconds wondering what this interesting new "best CPUs" technical architecture was before realizing the article is SEO'd blogspam.
But with Genoa for example, socket to socket latency has climbed up to 220ns, and going across nodes on a socket is 110ns. I feel like CXL will be less than 2x a hit, if only because cores themselves are having higher and higher latencies. https://chipsandcheese.com/2023/07/17/genoa-x-server-v-cache...
https://www.computeexpresslink.org/about-cxlhttps://en.wikipedia.org/wiki/Compute_Express_Link
It might work for laptops, but it is not the main goal and it might be only Framework who will do something in that direction.
(I do embedded and not server stuff mostly these days anyways, but generally interested in systems stuff, so)
The purpose of CXL is to allow for memory coherency between different CXL devices. To quote the spec on type 2 devices:
> CXL Type 2 devices, in addition to fully coherent cache, also have memory, for example DDR, High-Bandwidth Memory (HBM), etc., attached to the device. These devices execute against memory, but their performance comes from having massive bandwidth between the accelerator and device-attached memory. The main goal for CXL is to provide a means for the Host to push operands into device-attached memory and for the Host to pull results out of device-attached memory such that it does not add software and hardware cost that offsets the benefit of the accelerator.
There's some cool things you can do with CXL, like resurrecting the whole persistent memory idea with low-latency flash, making hardware offload devices more capable since you now get free cache coherency, and a whole bunch of other stuff.
But yes, it's really not for consumer use-cases. The applications I've seen colleagues work on are mostly enterprise stuff like cool RDMA integrations, cache-coherent flash, and more I can't talk about here.
It does kind of upset me a bit that CXL 3.0 still seems purely host-to-switch-to-device oriented. If you have your formerly PCIe slots on your cores speaking CXL and doing directory memory over fabric, I'd really really love to be able to talk to other hosts. Maybe that happens & is possible in 3.0, but it feels like CXL isnt paving that cowpath, isn't making is obvious, and that there will be a bunch of proprietary nasty ways to bridge computers & chat over CXL that are all non-standard, & I wish CXL had been more direct about making themselves & their upcoming switched fabric viable & interesting for host-to-host.
Persistent Memory databases are such a neat idea, it's a shame that the hardware for it isn't commonplace.
Alas, optane's dead now. I do know people actively working on resurrecting a lot of pmem work on low-latency flash, however, and it seems like this is one area with a low of momentum behind it.
wasn't Optane (hardware) trying to use CXL?
Dead Comment
Maybe switch to the source article? https://www.servethehome.com/fadu-cxl-2-0-switch-and-pcie-ge...
What are you going to do with the other 999,999,360KB?
Anything to do with data processing, machine learning, llms- hundreds of gigabytes of ram can be incredibly nice.
Especially as pandas recommends ram equal to 5-10x the size of the dataset.
When you're an individual- not having to think about managing a cluster...
Edit: One of the lucky 10000- famous bill gates joke- got it. I walk away cultured
OK? What laptops even support it? How much capacity is there ("over 1TB" is kind of vague)? So many questions...
[0] Emulating CXL Shared Memory Devices in QEMU https://memverge.com/cxl-qemuemulating-cxl-shared-memory-dev...
[1] CXL support in QEMU https://www.qemu.org/docs/master/system/devices/cxl.html
But with Genoa for example, socket to socket latency has climbed up to 220ns, and going across nodes on a socket is 110ns. I feel like CXL will be less than 2x a hit, if only because cores themselves are having higher and higher latencies. https://chipsandcheese.com/2023/07/17/genoa-x-server-v-cache...
I have hopes. I can't help it. Cool shit is inspiring. I, however, am not holding my breath.