In the author's scenario, there are zero benefits in using NVMe/TCP, as he just ends up doing a serial block copy using dd(1) so he's not leveraging concurrent I/O. All the complex commands can be replaced by a simple netcat.
On the destination laptop:
$ nc -l -p 1234 | dd of=/dev/nvme0nX bs=1M
On the source laptop:
$ nc x.x.x.x 1234 </dev/nvme0nX
The dd on the destination is just to buffer writes so they are faster/more efficient. Add a gzip/gunzip on the source/destination and the whole operation is a lot faster if your disk isn't full, ie. if you have many zero blocks. This is by far my favorite way to image a PC over the network. I have done this many times. Be sure to pass "--fast" to gzip as the compression is typically a bottleneck on GigE. Or better: replace gzip/gunzip with lz4/unlz4 as it's even faster. Last time I did this was to image a brand new Windows laptop with a 1TB NVMe. Took 20 min (IIRC?) over GigE and the resulting image was 20GB as the empty disk space compresses to practically nothing. I typically back up that lz4 image and years later when I donate the laptop I restore the image with unlz4 | dd. Super convenient.
That said I didn't know about that Linux kernel module nvme-tcp. We learn new things every day :) I see that its utility is more for mounting a filesystem over a remote NVMe, rather than accessing it raw with dd.
Edit: on Linux the maximum pipe buffer size is 64kB so the dd bs=X argument doesn't technically need to be larger than that. But bs=1M doesn't hurt (it buffers the 64kB reads until 1MB has been received) and it's future-proof if the pipe sizes is ever increased :) Some versions of netcat have options to control the input and output block size which would alleviate the need to use dd bs=X but on rescue discs the netcat binary is usually a version without these options.
Note that you can increase pipe buffer, I think default maximum size is usually around 1MB. A bit tricky to do from command line, one possible implementation being https://unix.stackexchange.com/a/328364
It's a little grimy, but if you use `pv` instead of `dd` on both ends you don't have to worry about specifying a sensible block size and it'll give you nice progression graphs too.
About 9 years ago, I consulted for a company that had a bad internal hack - disgruntled cofounder. Basically, a dead man’s switch was left that copied out the first 20mb of every disk to some bucket and then zeroed out. To recover the data we had to use test disk to rebuilt the partition table… but before doing that we didn’t want to touch the corrupted disks so we ended up copying out about 40tb using rescue flash disks, nectat and drive (some of the servers had a physical raid with all slots occupied so you couldn’t use some free had slots). Something along the lines of dd if=/dev/sdc bs=xxx | gzip | nc -l -p 8888 and the reverse on the other side. It actually worked surprisingly well. One thing of note,try combinations of dd bs to match with sector size - proper sizing had a large impact on dd throughput
This use of dd may cause corruption! You need iflag=fullblock to ensure it doesn't truncate any blocks, and (at the risk of cargo-culting) conv=sync doesn't hurt as well. I prefer to just nc -l -p 1234 > /dev/nvme0nX.
Partial reads won't corrupt the data. Dd will issue other read() until 1MB of data is buffered. The iflag=fullblock is only useful when counting or skipping bytes or doing direct I/O. See line 1647: https://github.com/coreutils/coreutils/blob/master/src/dd.c#...
According to the documentation of dd, "iflag=fullblock" is required only when dd is used with the "count=" option.
Otherwise, i.e. when dd has to read the entire input file because there is no "count=" option, "iflag=fullblock" does not have any documented effect.
From "info dd":
"If short reads occur, as could be the case when reading from a pipe for example, ‘iflag=fullblock’ ensures that ‘count=’ counts complete input blocks rather than input read operations."
I doubt it makes a difference. SSDs are an awful lot better at sequential writes than random writes, and concurrent IO would mainly speed up random access.
Besides, I don't think anyone really has a local network which is faster than their SSD. Even a 4-year-old consumer Samsung 970 Pro can sustain full-disk writes at 2.000M Byte/s, easily saturating a 10Gbit connection.
If we're looking at state-of-the-art consumer tech, the fastest you're getting is a USB4 40Gbit machine-to-machine transfer - but at that point you probably have something like the Crucial T700, which has a sequential write speed of 11.800 MByte/s.
The enterprise world probably doesn't look too different. You'd need a 100Gbit NIC to saturate even a single modern SSD, but any machine with such a NIC is more likely to have closer to half a dozen SSDs. At that point you're starting to be more worried about things like memory bandwidth instead. [0]
If you search for “dd gzip” or “dd lz4” you can find several ways to do this. In general, interpose a gzip compression command between the sending dd and netcat, and a corresponding decompression command between the receiving netcat and dd.
> That said I didn't know about that Linux kernel module nvme-tcp. We learn new things every day :) I see that its utility is more for mounting a filesystem over a remote NVMe, rather than accessing it raw with dd.
Aside, I guess nvme-tcp would result in less writes as you only copy files in stead of writing the whole disk over?
As a sysadmin, I'd rather use NVMe TCP or Clonezilla to do a slow write rather than trying to go 5% faster with more moving parts and chance to corrupt my drive in the process.
Plus, a it'd be well deserved coffee break.
Considering I'd be going at GigE speeds at best, I'd add "oflag=direct" to bypass caching on the target. A bog standard NVMe can write >300MBps unhindered, so trying to cache is moot.
Lastly, parted can do partition resizing, but given the user is not a power user to begin with, it's just me nitpicking. Nice post otherwise.
NVMe/TCP or Clonezilla are vastly more moving parts and chances to mess up the options, compared to dd. In fact, the author's solution exposes his NVMe to unauthenticated remote write access by any number of clients(!) By comparison, the dd on the source is read-only, and the dd on the destination only accepts the first connection (yours) and no one else on the network can write to the disk.
I strongly recommend against oflag=direct as in this specific use case it will always degrade performance. Read the O_DIRECT section in open(2). Or try it. Basically using oflag=direct locks the buffer so dd will have to wait for the block to be written by the kernel to disk until it can start reading the data again to fill the buffer with the next block, thereby reducing performance.
dd if=/dev/device | mbuffer to Ethernet to mbuffer dd of=/dev/device
(with some switches to select better block size and tell mbuffer to send/receive from a TCP socket)
If it's on a system with a fast enough processor I can save considerable time by compressing the stream over the network connection. This is particularly true when sending a relatively fresh installation where lots of the space on the source is zeroes.
> The NVM Express consortium ratified NVMe/TCP as a binding transport layer in November 2018. The standard evolved from a code base originally submitted to NVM Express by Lightbits' engineering team.
This is actually much better because nbdcopy can handle sparse files, you can set the number of connections and threads to number of cores, you can force a flush before exit, and enable a progress bar. For unencrypted drives it also supports TLS.
Previously I cloned but I this time I wanted to refresh some of the configs.
Using a usb-c cable to transfer at 10gb/s is so useful (as my only other option was WiFi).
When you plug the computers together they form an ad-hoc network and you can just rsync across. As far as I could tell the link was saturated so using anything else (other protocols) would be pointless. Well not pointless, it's really good to learn new stuff, maybe just not when you are cloning your laptop (joke)!
I assume that those USB-C connectors were USB 4 or Thunderbolt, not USB 3.
With Thunderbolt and operating systems that support Ethernet over Thunderbolt, a virtual network adapter is automatically configured for any Thunderbolt connector, so connecting a USB C cable between 2 such computers should just work, as if they had Ethernet 10 Gb/s connectors.
With USB 3 USB C connectors, you must use USB network adapters (up to 2.5 Gb/s Ethernet).
I did this also, but had to buy a ~$30+ Thunderbolt 4 cable to get the networking to work. Just a USB3-C cable was not enough.
The transfer itself screamed and I had a terabyte over in a few mins. Also I didn't bother with encryption on this one, so that simplified things a lot.
I don't understand why he didn't pipe btrfs through network. Do a btrfs snapshot first, then btrfs send => nc => network => nc => brtfs receive. That way only blocks in use are sent.
That's the first thing I thought when I saw he used btrfs. I use btrfs send/receive all the time via SSH and it works great. He could easily have setup SSH server in GRML live session.
There's one caveat though: With btrfs it is not possible to send snapshots recursively, so if he had lots of recursive snapshots (which can happen in Docker/LXD/Incus), it is relatively hard to mirror the same structure in a new disk. I like btrfs, but recursive send/receive is one aspect ZFS is just better.
Yea, there needs to be a `snapshot -r` option or something. I like using subvolumes to manage what gets a snapshot but sometimes you want the full disk.
I recently had to copy around 200gb of files over wifi. I used rsync to make sure a connection failure doesn't mean I have to start over and so that nothing is lost, but it took at least 6 hours. I wonder what could I have done better.
Btw, what kinds of guarantees do you get with the dd method? Do you have to compare md5s of the resulting block level devices after?
> 200gb of files over wifi […] took at least 6 hours
> I wonder what could I have done better.
Used an Ethernet cable? That’s not an impressive throughput amount over local. WiFi has like a million more sources of perf bottlenecks. Btw, just using a cable on ONE of the device => router ~> device can help a lot.
Rsync cannot transfer more than one file at a time so if you were transferring a lot of small files that was probably the bottleneck. You could either use xargs/parallel to split the file list and run multiple instances of rsync or use something like rclone, which supports parallel transfers on its own.
6 hours is roughly 10MB/s, so you likely could have gone much much quicker. Did you compress with `-z`? If you could use ethernetyou probably could have done it at closer to 100MB/s on most deviceds, which would have been 35 minutes.
No, I didn't use compression. Would it be useful over a high-bandwidth connection? I presume that it wasn't wifi bandwidth that was bottlenecking, though I've not really checked.
One thing I could have done is found a way to track total progress, so that I could have noticed that this is going way too slow.
If your transport method for rsync was ssh, that is often a bottleneck, as openssh has historically had some weird performance limits that needed obscure patches to get around. Enabling compression helps too if your CPU doesn't become a bottleneck
Wifi shares the medium (air) with all other radios. It has a random time it waits after it stops if it sees a collision.
"Carrier-sense multiple access with collision avoidance (CSMA/CA) in computer networking, is a network multiple access method in which carrier sensing is used, but nodes attempt to avoid collisions by beginning transmission only after the channel is sensed to be "idle".[1][2] When they do transmit, nodes transmit their packet data in its entirety.
It is particularly important for wireless networks, where the alternative with collision detection CSMA/CD, is not possible due to wireless transmitters desensing (turning off) their receivers during packet transmission.
CSMA/CA is unreliable due to the hidden node problem.[3][4]
CSMA/CA is a protocol that operates in the data link layer."
I'm sure there are benefits to this approach, but I've transferred laptops before by launching an installer on both and then combining dd and nc on both ends. If I recall correctly, I also added gzip to the mix to make transferring large null sections faster.
With the author not having access to an ethernet port on the new laptop, I think my hacky approach might've even been faster because of the slight boost compression would've provided, given that the network speed is nowhere near the speed limit compression would add to a fast network link.
Well, if it's whole-disk encrypted, unless they told LUKS to pass TRIM through, you'd not be getting anything but essentially random data for the way the author described it.
I think I did something like `nc -l -p 1234 | gunzip | dd status=progress of=/dev/nvme0n1` on the receiving end and `dd if=/dev/nvme0n1 bs=40M status=progress | gzip | nc 10.1.2.3:1234` on the sending end, after plugging an ethernet cable into both devices. In theory I could've probably also used the WiFi cards to set up a point to point network to speed up the transmission, but I couldn't be bothered with looking up how to make nc use mptcp like that.
Clonezilla is great. It's got one job and it usually succeeds the first time. My only complaint is the initial learning curve requires tinkering. It's still not at the trust level of fire and forget. Experimenting is recommended as backup is never the same thing as a backup and restore and even Clonezilla will have issues recreating partitions on disks that are very different from their source.
On the destination laptop:
On the source laptop: The dd on the destination is just to buffer writes so they are faster/more efficient. Add a gzip/gunzip on the source/destination and the whole operation is a lot faster if your disk isn't full, ie. if you have many zero blocks. This is by far my favorite way to image a PC over the network. I have done this many times. Be sure to pass "--fast" to gzip as the compression is typically a bottleneck on GigE. Or better: replace gzip/gunzip with lz4/unlz4 as it's even faster. Last time I did this was to image a brand new Windows laptop with a 1TB NVMe. Took 20 min (IIRC?) over GigE and the resulting image was 20GB as the empty disk space compresses to practically nothing. I typically back up that lz4 image and years later when I donate the laptop I restore the image with unlz4 | dd. Super convenient.That said I didn't know about that Linux kernel module nvme-tcp. We learn new things every day :) I see that its utility is more for mounting a filesystem over a remote NVMe, rather than accessing it raw with dd.
Edit: on Linux the maximum pipe buffer size is 64kB so the dd bs=X argument doesn't technically need to be larger than that. But bs=1M doesn't hurt (it buffers the 64kB reads until 1MB has been received) and it's future-proof if the pipe sizes is ever increased :) Some versions of netcat have options to control the input and output block size which would alleviate the need to use dd bs=X but on rescue discs the netcat binary is usually a version without these options.
Note that you can increase pipe buffer, I think default maximum size is usually around 1MB. A bit tricky to do from command line, one possible implementation being https://unix.stackexchange.com/a/328364
Otherwise, i.e. when dd has to read the entire input file because there is no "count=" option, "iflag=fullblock" does not have any documented effect.
From "info dd":
"If short reads occur, as could be the case when reading from a pipe for example, ‘iflag=fullblock’ ensures that ‘count=’ counts complete input blocks rather than input read operations."
Deleted Comment
I guess most people don't have faster local network than an SSD can transfer.
I wonder though, for those people who do, does a concurrent I/O block device replicator tool exist?
Btw, you might want also use pv in the pipeline to see an ETA, although it might have a small impact on performance.
Besides, I don't think anyone really has a local network which is faster than their SSD. Even a 4-year-old consumer Samsung 970 Pro can sustain full-disk writes at 2.000M Byte/s, easily saturating a 10Gbit connection.
If we're looking at state-of-the-art consumer tech, the fastest you're getting is a USB4 40Gbit machine-to-machine transfer - but at that point you probably have something like the Crucial T700, which has a sequential write speed of 11.800 MByte/s.
The enterprise world probably doesn't look too different. You'd need a 100Gbit NIC to saturate even a single modern SSD, but any machine with such a NIC is more likely to have closer to half a dozen SSDs. At that point you're starting to be more worried about things like memory bandwidth instead. [0]
[0]: http://nabstreamingsummit.com/wp-content/uploads/2022/05/202...
For example: https://unix.stackexchange.com/questions/632267
Aside, I guess nvme-tcp would result in less writes as you only copy files in stead of writing the whole disk over?
For compression the rule is that you don't do it if the CPU can't compress faster than the network.
Deleted Comment
Plus, a it'd be well deserved coffee break.
Considering I'd be going at GigE speeds at best, I'd add "oflag=direct" to bypass caching on the target. A bog standard NVMe can write >300MBps unhindered, so trying to cache is moot.
Lastly, parted can do partition resizing, but given the user is not a power user to begin with, it's just me nitpicking. Nice post otherwise.
I strongly recommend against oflag=direct as in this specific use case it will always degrade performance. Read the O_DIRECT section in open(2). Or try it. Basically using oflag=direct locks the buffer so dd will have to wait for the block to be written by the kernel to disk until it can start reading the data again to fill the buffer with the next block, thereby reducing performance.
dd is installed on every system, and if you don't have nc you can still use ssh and sacrifice a bit of performance.
If it's on a system with a fast enough processor I can save considerable time by compressing the stream over the network connection. This is particularly true when sending a relatively fresh installation where lots of the space on the source is zeroes.
Where does one learn this black art?
Deleted Comment
Dead Comment
https://www.techtarget.com/searchstorage/news/252459311/Ligh...
> The NVM Express consortium ratified NVMe/TCP as a binding transport layer in November 2018. The standard evolved from a code base originally submitted to NVM Express by Lightbits' engineering team.
https://www.lightbitslabs.com/blog/linux-distributions-nvme-...
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
Previously I cloned but I this time I wanted to refresh some of the configs.
Using a usb-c cable to transfer at 10gb/s is so useful (as my only other option was WiFi).
When you plug the computers together they form an ad-hoc network and you can just rsync across. As far as I could tell the link was saturated so using anything else (other protocols) would be pointless. Well not pointless, it's really good to learn new stuff, maybe just not when you are cloning your laptop (joke)!
Serious question since last time I tried a direct non-Ethernet connection was sometime in the 90s ;)
With Thunderbolt and operating systems that support Ethernet over Thunderbolt, a virtual network adapter is automatically configured for any Thunderbolt connector, so connecting a USB C cable between 2 such computers should just work, as if they had Ethernet 10 Gb/s connectors.
With USB 3 USB C connectors, you must use USB network adapters (up to 2.5 Gb/s Ethernet).
The transfer itself screamed and I had a terabyte over in a few mins. Also I didn't bother with encryption on this one, so that simplified things a lot.
There's one caveat though: With btrfs it is not possible to send snapshots recursively, so if he had lots of recursive snapshots (which can happen in Docker/LXD/Incus), it is relatively hard to mirror the same structure in a new disk. I like btrfs, but recursive send/receive is one aspect ZFS is just better.
Btw, what kinds of guarantees do you get with the dd method? Do you have to compare md5s of the resulting block level devices after?
> I wonder what could I have done better.
Used an Ethernet cable? That’s not an impressive throughput amount over local. WiFi has like a million more sources of perf bottlenecks. Btw, just using a cable on ONE of the device => router ~> device can help a lot.
Rsync cannot transfer more than one file at a time so if you were transferring a lot of small files that was probably the bottleneck. You could either use xargs/parallel to split the file list and run multiple instances of rsync or use something like rclone, which supports parallel transfers on its own.
6 hours is roughly 10MB/s, so you likely could have gone much much quicker. Did you compress with `-z`? If you could use ethernetyou probably could have done it at closer to 100MB/s on most deviceds, which would have been 35 minutes.
One thing I could have done is found a way to track total progress, so that I could have noticed that this is going way too slow.
> --adapt[=min=#,max=#]: zstd will dynamically adapt compression level to perceived I/O conditions.
"Carrier-sense multiple access with collision avoidance (CSMA/CA) in computer networking, is a network multiple access method in which carrier sensing is used, but nodes attempt to avoid collisions by beginning transmission only after the channel is sensed to be "idle".[1][2] When they do transmit, nodes transmit their packet data in its entirety.
It is particularly important for wireless networks, where the alternative with collision detection CSMA/CD, is not possible due to wireless transmitters desensing (turning off) their receivers during packet transmission.
CSMA/CA is unreliable due to the hidden node problem.[3][4]
CSMA/CA is a protocol that operates in the data link layer."
https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...
dd doesn't skip empty blocks, like clonezilla would do.
With the author not having access to an ethernet port on the new laptop, I think my hacky approach might've even been faster because of the slight boost compression would've provided, given that the network speed is nowhere near the speed limit compression would add to a fast network link.
I think I did something like `nc -l -p 1234 | gunzip | dd status=progress of=/dev/nvme0n1` on the receiving end and `dd if=/dev/nvme0n1 bs=40M status=progress | gzip | nc 10.1.2.3:1234` on the sending end, after plugging an ethernet cable into both devices. In theory I could've probably also used the WiFi cards to set up a point to point network to speed up the transmission, but I couldn't be bothered with looking up how to make nc use mptcp like that.
True, I do always just take the NVME disk out of the laptop and put it in a highspeed dock.