Note there is no intrinsic reason running multiple streams should be faster than one [EDIT: "at this scale"]. It almost always indicates some bottleneck in the application or TCP tuning. (Though, very fast links can overwhelm slow hardware, and ISPs might do some traffic shaping too, but this doesn't apply to local links).
SSH was never really meant to be a high performance data transfer tool, and it shows. For example, it has a hardcoded maximum receive buffer of 2MiB (separate from the TCP one), which drastically limits transfer speed over high BDP links (even a fast local link, like the 10gbps one the author has). The encryption can also be a bottleneck. hpn-ssh [1] aims to solve this issue but I'm not so sure about running an ssh fork on important systems.
The 2MiB are per SSH "channel" -- the SSH protocol multiplexes multiple independent transmission channels over TCP [1], and each one has its own window size.
rsync and `cat | ssh | cat` only use a single channel, so if their counterparty is an OpenSSH sshd server, their throughput is limited by the 2MiB window limit.
rclone seems to be able to use multiple ssh channels over a single connection; I believe this is what the `--sftp-concurrency` setting controls.
Some more discussion about the 2MiB limit and links to work for upstreaming a removal of these limits can be found in my post [3].
Looking into it just now, I found that the SSH protocol itself already supports dynamically growing per-channel window sizes with `CHANNEL_WINDOW_ADJUST`, and OpenSSH seems to generally implement that. I don't fully grasp why it doesn't just use that to extend as needed.
I also found that there's an official `no-flow-control` extension with the description
> channel behaves as if all window sizes are infinite.
>
> This extension is intended for, but not limited to, use by file transfer applications that are only going to use one channel and for which the flow control provided by SSH is an impediment, rather than a feature.
So this looks exactly as designed for rsync. But no software implements this extension!
I wrote those things down in [4].
It is frustrating to me that we're only a ~200 line patch away from "unlimited" instead of shitty SSH transfer speeds -- for >20 years!
In general TCP just isn't great for high performance. In the film industry we used to use a commercial product Aspera (now owned by IBM) which emulated ftp or scp but used UDP with forward error correction (instead of TCP retransmission). You could configure it to use a specific amount of bandwidth and it would just push everything else off the network to achieve it.
I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.
I have also seen plenty situations in which TCP is faster than UDP in datacenters.
For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.
I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.
Aspera's FASP [0] is very neat. One drawback to it is that the TCP stuff not being done the traditional way must be done on CPU. Say if one packet is missing or if packets are sent out of order, the Aspera client fixes those instead of all that being done as TCP.
As I understand it, this is also the approach of WEKA.io [1]. Another approach is RDMA [2] used by storage systems like Vast which pushes those order and resend tasks to NICs that support RDMA so that applications can read and write directly to the network instead of to system buffers.
I think a lot of file transfer issues that occur outside of the corporate intranet world involve hardware that you don't fully control on (at least) one hand. In science, for example, transferring huge amounts of data over long distances is pretty common, and I've had to do this on boxes that had poor TCP buffer configurations. Being able to multiplex your streams in situations like this is invaluable and I'd love to see more open source software that does this effectively, especially if it can punch through a firewall.
> Note there is no intrinsic reason running multiple streams should be faster than one.
The issue is the serialization of operations. There is overhead for each operation which translates into dead time between transfers.
However there are issues that can cause singular streams to underperform multiple streams in the real world once you reach a certain scale or face problems like packet loss.
Great point, and even beyond that I think (based on the paths) it was just a command line invocation, with something like NFS handling all the networking.
Uhh.. I work with this stuff daily and there are a LOT of intrinsic reasons a single stream would be slower than running multiple: MPLS ECMP hashing you over a single path, a single loss event with a high BDP causing congestion control to kick in for a single flow, CPU IRQ affinity, probably many more I’m not thinking like the inner workings of NIC offloading queues.
Source: Been in big tech for roughly ten years now trying to get servers to move packets faster
Ha, it sounds like the best way to learn something is to make a confident and incorrect claim :)
> MPLS ECMP hashing you over a single path
This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.
> a single loss event with a high BDP
I thought BBR mitigates this. Even if it doesn't, I'd still count that as a TCP stack issue.
At a large enough scale I'd say you are correct that multiple streams is inherently easier to optimize throughput for. But probably not a single 1-10gb link though.
> It almost always indicates some bottleneck in the application or TCP tuning.
Yeah, this has been my experience with low-overhead streams as well.
Interestingly, I see a ubiquity of this "open more streams to send more data" pattern all over the place for file transfer tooling.
Recent ones that come to mind have been BackBlaze's CLI (B2) and taking a peek at Amazon's SDK for S3 uploads with Wireshark. (What do they know that we don't seem to think we know?)
It seems like they're all doing this? Which is maybe odd, because when I analyse what Plex or Netflix is doing, it's not the same? They do what you're suggesting, tune the application + TCP/UDP stack. Though that could be due to their 1-to-1 streaming use case.
There is overhead somewhere and they're trying to get past it via semi-brute-force methods (in my opinion).
I wonder if there is a serialization or loss handling problem that we could be glossing over here?
that is a different problem. For S3-esque transfers you might very well be limited by ability for target to receive X MB/s and not more and so starting parallel streams will make it faster.
I used B2 as third leg for our backups and pretty much had to give rclone more connections at once because defaults were nowhere close to saturating bandwidth
Memory and CPU are cheap (up to a point) so why not just copy/paste TCP streams. It neatly fits into multi-processing/threading as well.
When we were doing 100TB backups of storage servers we had a wrapper that run multiple rsyncs over the file system, that got throughput up to about 20gigbits a second over lan
Tuning on Linux requires root and is systemwide. I don't think BBR is even available on other systems. And you need to tune the buffer sizes of both ends too. Using multiple streams is just less of a hassle for client users. It can also fool some traffic shaping tools. Internal use is a different story.
> Note there is no intrinsic reason running multiple streams should be faster than one
Inherent reasons or no, it's been my experience across multiple protocols, applications, network connections and environments, and machines on both ends, that, _in fact_, splitting data up and operating using multiple streams is significantly faster.
So, ok, it might not be because of an "inherent reason", but we still have to deal with it in real life.
> SSH was never really meant to be a high performance data transfer tool, and it shows.
A practical example can be `ssh -X` vs X11 over Wireguard. The lag is obvious with the former, but X11 windows from remote clients can be indistinguishable performance-wise from those of local ones with the latter.
Note there is no intrinsic reason running multiple streams should be faster than one
If the server side scales (as cloud services do) it might end up using different end points for the parallel connections and saturate the bandwidth better. One server instance might be serving other clients as well and can't fill one particular client's pipe entirely.
If the application handles them serially, then yeah. But one can imagine the application opening files in threads, buffering them, and then finally sending it at full speed, so in that sense it is an application issue. If you truly have millions of small files, you're more likely to be bottlenecked by disk IO performance rather than application or network, though. My primary use case for ssh streams is zfs send, which is mostly bottlenecked by ssh itself.
Single file overheads (opening millions of tiny files whose metadata is not in the OS cache and reading them) appears to be an intrinsic reason (intrinsic to the OS, at least).
I mean isn't a single TCP connections throughput limited by the latency? Which is why in high(er) latency WAN links you generally want to open multiple connections for large file transfers.
Only simpler transfer protocols, like TFTP, are limited by latency.
The whole reason for the existence of TCP is to overcome the throughput limit determined by latency. On a network with negligible latency there is no need for TCP (you could just send each packet only after the previous is acknowledged, but the higher is the throughput of your network interface, the less likely is that the latency can be negligible).
However, for latency to not matter, the TCP windows must be large enough (i.e. the amount of data that is sent before an acknowledge is received, which happens after a delay caused by latency).
I use Windows very rarely today, so I do not know its current status, but until the Windows XP days it was very frequent for Windows computers to have very small default TCP window sizes, which caused low throughput on high-latency networks, so on such networks they had to be reconfigured.
On high-latency networks, opening multiple connections is just a workaround for not having appropriate network settings. However, even when your own computer is configured optimally, opening multiple connections can be a workaround against various kinds of throttling implemented either by some intermediate ISP or by the destination server, though nowadays most rate limits are applied globally, to all connections from the same IP address, in order to make this workaround ineffective.
Rclone is a fantastic tool, but my favorite part of it is actually the underyling FS library. I've started baking Rclone FS into internal Go tooling and now everything transparently supports reading/writing to either local or remote storage. Really great for being able to test data analysis code locally and then running as batch jobs elsewhere.
What kind of data analysis do you run with Go and do you use an open source library for it? My experience with stats libraries in Go has been lukewarm so far.
RClone has been so useful over the years I built a fully managed service on top of it specifically for moving data between cloud storage providers: https://dataraven.io/
My goal is to smooth out some of the operational rough edges I've seen companies deal with when using the tool:
- Team workspaces with role-based access control
- Event notifications & webhooks – Alerts on transfer failure or resource changes via Slack, Teams, Discord, etc.
- Centralized log storage
- Vault integrations – Connect 1Password, Doppler, or Infisical for zero-knowledge credential handling (no more plain text files with credentials)
- 10 Gbps connected infrastructure (Pro tier) – High-throughput Linux systems for large transfers
I hope that you sponsor the rclone project given that it’s the core of your business! I couldn’t find any indication online that you do give back to the project. I hope I’m wrong.
Gifts do not confer obligation. If you give me a screwdriver and I use it to run my electrical installation service business, I don’t owe you a payment.
This idea that one must “give back” after receiving a gift freely given is simply silly.
How do you deal with how poorly rclone handles rate limits? It doesn't honor dropbox's retry-after header and just adds an exponential back off that, in my migrations, has resulted in a pause of days.
I've adjusted threads and the various other controls rclone offers but I still feel like I'm not see it's true potential because the second it hits a rate limit I can all but guarantee that job will have to be restarted with new settings.
I honestly haven't used it with Dropbox before, have you tried adjusting --tpslimit 12 --tpslimit-burst 0 flags? Are you creating a dedicated api key for the transfer? Rate limits may vary between Plus/Advanced forum.rclone.org is quite active you may want to post more details there.
2. do you have an example of what indexed backups would look like? Im thinking of macos time machine, where each backup only contains deltas from the last backup. Or am I completely off?
Interesting that nobody has mentioned: Warp speed Data Transfer (WDT)[1].
From the readme:
- Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.
- Goal: Lowest possible total transfer time - to be only hardware limited (disc or network bandwidth not latency) and as efficient as possible (low CPU/memory/resources utilization)
I prefer rsync because of its delta transfer which doesn't resend files already on the destination, saving bandwidth. This combined with rsync's ability to work over ssh lets me sync anywhere rsync runs, including the cloud. It may not be faster than rclone but it is more conserving on bandwidth.
The delta-transfer algorithm [0] is about detecting which chunks of a file differ on source and target [1], and limiting the transfer to those chunks. The savings depend on how and where they differ, and ofcourse there's tradeoffs...
You seem to be referring to the selection of candidates of files to transfer (along several possible criteria like modification time, file size or file contents using checksumming) [2]
Rsync is great. However for huge filesystems (many files and directories) with relatively less change, you'll need to think about "assisting" it somewhat (by feeding it its candidates obtained in a more efficient way, using --files-from=). For example: in a renderfarm system you would have additions of files, not really updates. Keep a list of frames that have finished rendering (in a cinematic film production this could be eg. 10h/frame), and use it to feed rsync. Otherwise you'll be spending hours for rsync to build its index (both sides) over huge filesystems, instead of transferring relatively few big and new files.
In workloads where you have many sync candidates (files) that have a majority of differing chunks, it might be worth rather disabling the delta-transfer algorithm (--whole-file) and saving on the tradeoffs.
Rclone can "sync" with a range of different ways to check if the existing files are the same. If no hashes are available (e.g. WebDAV) I think you can set it to check by timestamp (with a tolerance) and size.
Edit: oh I see, delta transfer only sends the changed parts of files?
In my case I get 8 MB/s with rsync, 80 MB/s with rclone. This is with cross-continent transfers so 200ms latency. If it was just a 2x difference I'd be fine with rsync but 10x is very hard to ignore.
The article links to a YouTube mini-review of USB enclosures from UGreen and Acasis, neither of which he loves.[1] I've been happy with the OWC 1M2 as a boot drive on a Mac Studio with Thunderbolt 5 ports.[2] I just noticed that there is an OWC 1M2 80G, based on USB4 v2.[3] I didn't know that was a thing, but I guess it's the USB cousin to Thunderbolt 5.
It turns out, fpart does just that! Fpart is a Filesystem partitioner. It helps you sort file trees and pack them into bags (called "partitions"). It is developed in C and available under the BSD license.
My go-to for fast and easy parallelization is xargs -P.
find a-bunch-of-files | xargs -P 10 do-something-with-a-file
-P max-procs
--max-procs=max-procs
Run up to max-procs processes at a time; the default is 1.
If max-procs is 0, xargs will run as many processes as
possible at a time.
Rclone is such an elegant piece of software, reminds me of the time where most software worked well most of the time. There's few people that wouldn't benefit from it, either as a developer or end-user.
SSH was never really meant to be a high performance data transfer tool, and it shows. For example, it has a hardcoded maximum receive buffer of 2MiB (separate from the TCP one), which drastically limits transfer speed over high BDP links (even a fast local link, like the 10gbps one the author has). The encryption can also be a bottleneck. hpn-ssh [1] aims to solve this issue but I'm not so sure about running an ssh fork on important systems.
1. https://github.com/rapier1/hpn-ssh
For completeness, I want to add:
The 2MiB are per SSH "channel" -- the SSH protocol multiplexes multiple independent transmission channels over TCP [1], and each one has its own window size.
rsync and `cat | ssh | cat` only use a single channel, so if their counterparty is an OpenSSH sshd server, their throughput is limited by the 2MiB window limit.
rclone seems to be able to use multiple ssh channels over a single connection; I believe this is what the `--sftp-concurrency` setting controls.
Some more discussion about the 2MiB limit and links to work for upstreaming a removal of these limits can be found in my post [3].
Looking into it just now, I found that the SSH protocol itself already supports dynamically growing per-channel window sizes with `CHANNEL_WINDOW_ADJUST`, and OpenSSH seems to generally implement that. I don't fully grasp why it doesn't just use that to extend as needed.
I also found that there's an official `no-flow-control` extension with the description
> channel behaves as if all window sizes are infinite. > > This extension is intended for, but not limited to, use by file transfer applications that are only going to use one channel and for which the flow control provided by SSH is an impediment, rather than a feature.
So this looks exactly as designed for rsync. But no software implements this extension!
I wrote those things down in [4].
It is frustrating to me that we're only a ~200 line patch away from "unlimited" instead of shitty SSH transfer speeds -- for >20 years!
[1]: https://datatracker.ietf.org/doc/html/rfc4254#section-5
[2]: https://rclone.org/sftp/#sftp-concurrency
[3]: https://news.ycombinator.com/item?id=40856136
[4]: https://github.com/djmdjm/openssh-portable-wip/pull/4#issuec...
I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.
I have also seen plenty situations in which TCP is faster than UDP in datacenters.
For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.
I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.
As I understand it, this is also the approach of WEKA.io [1]. Another approach is RDMA [2] used by storage systems like Vast which pushes those order and resend tasks to NICs that support RDMA so that applications can read and write directly to the network instead of to system buffers.
0. https://en.wikipedia.org/wiki/Fast_and_Secure_Protocol
1. https://docs.weka.io/weka-system-overview/weka-client-and-mo...
2. https://en.wikipedia.org/wiki/Remote_direct_memory_access
There's gotta be a less antisocial way though. I'd say using BBR and increasing the buffer sizes to 64 MiB does the trick in most cases.
I think a lot of file transfer issues that occur outside of the corporate intranet world involve hardware that you don't fully control on (at least) one hand. In science, for example, transferring huge amounts of data over long distances is pretty common, and I've had to do this on boxes that had poor TCP buffer configurations. Being able to multiplex your streams in situations like this is invaluable and I'd love to see more open source software that does this effectively, especially if it can punch through a firewall.
The issue is the serialization of operations. There is overhead for each operation which translates into dead time between transfers.
However there are issues that can cause singular streams to underperform multiple streams in the real world once you reach a certain scale or face problems like packet loss.
rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".
If pipelining is really in rsync, there should be no "dead time between transfers".
Source: Been in big tech for roughly ten years now trying to get servers to move packets faster
> MPLS ECMP hashing you over a single path
This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.
> a single loss event with a high BDP
I thought BBR mitigates this. Even if it doesn't, I'd still count that as a TCP stack issue.
At a large enough scale I'd say you are correct that multiple streams is inherently easier to optimize throughput for. But probably not a single 1-10gb link though.
Yeah, this has been my experience with low-overhead streams as well.
Interestingly, I see a ubiquity of this "open more streams to send more data" pattern all over the place for file transfer tooling.
Recent ones that come to mind have been BackBlaze's CLI (B2) and taking a peek at Amazon's SDK for S3 uploads with Wireshark. (What do they know that we don't seem to think we know?)
It seems like they're all doing this? Which is maybe odd, because when I analyse what Plex or Netflix is doing, it's not the same? They do what you're suggesting, tune the application + TCP/UDP stack. Though that could be due to their 1-to-1 streaming use case.
There is overhead somewhere and they're trying to get past it via semi-brute-force methods (in my opinion).
I wonder if there is a serialization or loss handling problem that we could be glossing over here?
I used B2 as third leg for our backups and pretty much had to give rclone more connections at once because defaults were nowhere close to saturating bandwidth
When we were doing 100TB backups of storage servers we had a wrapper that run multiple rsyncs over the file system, that got throughput up to about 20gigbits a second over lan
cuz in my experience no one is doing that tbh
Inherent reasons or no, it's been my experience across multiple protocols, applications, network connections and environments, and machines on both ends, that, _in fact_, splitting data up and operating using multiple streams is significantly faster.
So, ok, it might not be because of an "inherent reason", but we still have to deal with it in real life.
A practical example can be `ssh -X` vs X11 over Wireguard. The lag is obvious with the former, but X11 windows from remote clients can be indistinguishable performance-wise from those of local ones with the latter.
If the server side scales (as cloud services do) it might end up using different end points for the parallel connections and saturate the bandwidth better. One server instance might be serving other clients as well and can't fill one particular client's pipe entirely.
Depending on what you're doing it can be faster to leave your files in a solid archive that is less likely to be fragmented and get contiguous reads.
https://wintelguy.com/wanperf.pl
The whole reason for the existence of TCP is to overcome the throughput limit determined by latency. On a network with negligible latency there is no need for TCP (you could just send each packet only after the previous is acknowledged, but the higher is the throughput of your network interface, the less likely is that the latency can be negligible).
However, for latency to not matter, the TCP windows must be large enough (i.e. the amount of data that is sent before an acknowledge is received, which happens after a delay caused by latency).
I use Windows very rarely today, so I do not know its current status, but until the Windows XP days it was very frequent for Windows computers to have very small default TCP window sizes, which caused low throughput on high-latency networks, so on such networks they had to be reconfigured.
On high-latency networks, opening multiple connections is just a workaround for not having appropriate network settings. However, even when your own computer is configured optimally, opening multiple connections can be a workaround against various kinds of throttling implemented either by some intermediate ISP or by the destination server, though nowadays most rate limits are applied globally, to all connections from the same IP address, in order to make this workaround ineffective.
Related to this is the very useful:
.. workflow that allows you to create append-only (immutable) backups.This howto is not rsync.net-specific - you can follow this recipe at any standard SSH endpoint:
https://www.rsync.net/resources/notes/2025-q4-rsync.net_tech...
My goal is to smooth out some of the operational rough edges I've seen companies deal with when using the tool:
This idea that one must “give back” after receiving a gift freely given is simply silly.
I've adjusted threads and the various other controls rclone offers but I still feel like I'm not see it's true potential because the second it hits a rate limit I can all but guarantee that job will have to be restarted with new settings.
That hasn't been true for more than 8 years now.
Source: https://github.com/rclone/rclone/blob/9abf9d38c0b80094302281...
And the PR adding it: https://github.com/rclone/rclone/pull/2622
2. do you have an example of what indexed backups would look like? Im thinking of macos time machine, where each backup only contains deltas from the last backup. Or am I completely off?
From the readme:
- Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.
- Goal: Lowest possible total transfer time - to be only hardware limited (disc or network bandwidth not latency) and as efficient as possible (low CPU/memory/resources utilization)
1. https://github.com/facebook/wdt
You seem to be referring to the selection of candidates of files to transfer (along several possible criteria like modification time, file size or file contents using checksumming) [2]
Rsync is great. However for huge filesystems (many files and directories) with relatively less change, you'll need to think about "assisting" it somewhat (by feeding it its candidates obtained in a more efficient way, using --files-from=). For example: in a renderfarm system you would have additions of files, not really updates. Keep a list of frames that have finished rendering (in a cinematic film production this could be eg. 10h/frame), and use it to feed rsync. Otherwise you'll be spending hours for rsync to build its index (both sides) over huge filesystems, instead of transferring relatively few big and new files.
In workloads where you have many sync candidates (files) that have a majority of differing chunks, it might be worth rather disabling the delta-transfer algorithm (--whole-file) and saving on the tradeoffs.
[0] https://www.andrew.cmu.edu/course/15-749/READINGS/required/c...
[1] https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_...
[2] https://en.wikipedia.org/wiki/Rsync#Determining_which_files_...
Edit: oh I see, delta transfer only sends the changed parts of files?
[1] https://www.youtube.com/watch?v=gaV-O6NPWrI
[2] https://eshop.macsales.com/shop/owc-express-1m2
[3] https://eshop.macsales.com/item/OWC/US4V2EXP1M2/
You can also run multiple instances of rsync, the problem seems how to efficiently divide the set of files.
It turns out, fpart does just that! Fpart is a Filesystem partitioner. It helps you sort file trees and pack them into bags (called "partitions"). It is developed in C and available under the BSD license.
It comes with an rsync wrapper, fpsync. Now I'd like to see a benchmark of that vs rclone! via https://unix.stackexchange.com/q/189878/#688469 via https://stackoverflow.com/q/24058544/#comment93435424_255320...
https://www.fpart.org/
I'm currently working on the GUI if you're interested: https://github.com/rclone-ui/rclone-ui