Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.
1. BitTorrent has a bad rep. Most people still associate it with just illegal download.
2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.
3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.
4. A lot of people think they are required to seed, and for some reason that scare the hell of them.
Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working.
I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.
At a previous job, I was downloading daily legal torrent data when IT flagged me. The IT admin, eager to catch me doing something wrong, burst into the room shouting with management in tow. I had to calmly explain the situation, as management assumed all torrenting was illegal and there had been previous legal issues with an intern pirating movies. Fortunately, other colleagues backed me up.
You know what has a bad rep? Big companies that use and trade my personal information like they own it. I'll start caring about copyrights when governments force these big companies to care about my information.
Webtorrent exists. It uses webrtc to let users connect to each other. There's support in popular trackers.
This basically handles every problem stated. There's nothing to install on computers: it's just js running on the page. There's no firewall rules or port forwarding to setup, all handled by the stun/turn in webrtc. Users wouldn't necessarily even be aware they are uploading.
> It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1
Well, in many such situations data is provided for free, putting huge burden on the other side. Even it it's a little bit less convenient it makes service a lot more sustainable. I imagine torrent for free tier and direct download as a premium option would work perfectly
5. Most residential internet connections are woefully underprovisioned for upload so anything that uses it more (and yes you need people to seed for bittorrent to make sense) can slow down the entire connection.
6. Service providers have little control over the service level of seeders and thus the user experience. And that's before you get malicious users.
seeding is uploading after you are done downloading.
but you are already uploading while you are still downloading. and that can't be turned off. if seeding scares someone, then uploading should scare them too. so they are right, because they are required to upload.
Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html
As someone who works with mapping data for HGV routing, I've been keeping an eye on Overture. I wonder do you know if anyone has measured the data coverage and quality between this and proprietary datasets like HEREmaps? Does Overture supplement OSM road attributes (such as max height restrictions) where they can find better data from other sources?
I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).
I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.
I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.
Would something like Git LFS help here? I’m at the limit of my understanding for this.
I used to work at a company that had to deliver huge files to every developer every week. At some point they switch from a thundering herd of rsyncs to using BitTorrent. The speed gains were massive.
It became disliked because of various problems and complaints, but mainly disappeared because Blozzard got the bright idea of preloading the patchset, especially to new expansions, in the weeks before. You can send down a ten gig patch a month before release, and then patch that patch a week before release, and a final small patch on the day before release, and everything is preloaded.
The great Carboniferous explosion of CDNs inspired by Netflix and friends has also greatly simplified the market, too.
https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper
I wouldn't expect that to hold up any more than a silly idea I had (probably not original) a while back of "Pi-Storage".
The basic idea being, can you using the digits of Pi to encode data, or rather, can you find ranges of Pi that map to data you have and use it for "compression".
A very simple example, let's take this portion of Pi:
Then let's say we have a piece of data that, when encoded and just numbers, results in: 15926535897997169626433832
Can we encode that as: 4–15, 39–43, 21–25, 26–29 and save space? The "compression" step would take a long time (at some point you have to stop searching for overlap as Pi goes on for forever).
Anyways, a silly thought experiment that your idea reminded me of.
Is C really "pure noise" if you can get A back out of it?
It's like an encoding format or primitive encryption, where A is merely transformed into unrecognizable data, meaningful noise, which still retains the entirety of the information.
What is the point of this? If you think you can mount an adequate defense based on xor in a court of law, then you are sorely mistaken. Any state attorney will say infringement with an additional step of obfuscation is still infringement, and any judge will follow that assessment.
To use bittorrent, your machine has to listen, and otherwise be somehow reachable. In many cases, it's not a given, and sometimes not even desirable. It sticks out.
I think a variant of bittorrent which may be successful in corporate and generally non-geek environments should have the following qualities:
- Work via WebSockets.
- Run in browser, no installation.
- Have no rebel reputation.
It's so obvious that it must have been implemented, likely multiple times. It would not be well-known because the very purpose of such an implementation would be to not draw attention.
WebTorrent is ubiquitous by now and also supported by Brave and many torrent clients. There is still much room to build, though. Get in, the water is warm!
In my experience the official software was very buggy and unreliable. Which isn't great for something about making immutable data live forever. I had bugs with silent data truncation, GC deleting live paths and the server itself just locking up and not providing anything it had to the network.
The developers always seemed focus on making new versions of the protocols with very minor changes (no more protocol buffers, move everything to CBOR) rather than actually adding new features like encryption support or making it more suitable for hosting static sites (which seems to have been on of its main niches).
It also would have been a great too for package repositories and other open source software archives. Large distros tend to have extensive mirror lists but you need to configure them, find out which ones have good performance for you and you can still only download from one mirror at a time. Decentralizing that would be very cool. Even if the average system doesn't seed any of the content the fact that anyone can just mirror the repo and downloads automatically start pulling from them was very nice. It also makes the download resilient to any official mirror going down or changing URL. The fact that there is strong content verification built in is also great. Typically software mirrors need to use additional levels of verification (like PGP signatures) to avoid trusting the mirror.
I really like the idea, and the protocol is pretty good overall. But the implementation and evolution really didn't work well in my opinion. I tried using it for a long time, offering many of my sites over it and mirroring various data. But eventually I gave up.
And maybe controversially it provided no capabilities for network separation and statistics tracking. This isn't critical for success but on entrypoint to this market is private file sharing sites. Having the option to use these things could give it a foot in the door and get a lot more people interested in development.
Hopefully the next similar protocol will come at some point, maybe it will catch on where IPFS didn't.
I used IPFS several years ago to get some rather large files from a friend, who had recently been interested in IPFS. From what I recall it took a full week or so to start actually transferring the files. It was so slow and finicky to connect. Bittorrent is dramatically easier to use, faster, and more reliable. It was hard to take IPFS seriously after that. I also recall an IRC bot that was supposed to post links to memes at IPFS links and they were all dead, even though it's supposed to be more resilient. I don't have the backstory on that one to know how/why the links didn't work.
I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.
From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.
Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.
Not to mention, of course, that BitTorrent has a significant stigma attached to it.
The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.
For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".
The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.
BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.
> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...
As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.
This is a feature, not a bug. Torrent file/magnet link contains hash of a data which is immutable. Just publish new link (you should anyway, even with http)
It's not all about how you distribute content. We must also decide which content do distribute, and that is a hard problem.
The most successful strategy so far has been moderation. Moderation requires hierarchical authority: a moderator who arbitrarily determines which data is and is not allowed to flow. Even bittorrent traffic is moderated in most cases.
For data to flow over bittorrent, two things must happen:
1. There must be one or more seeders ready to connect when the leecher starts their download.
2. There must be a way for a prospective leecher to find the torrent.
The best way to meet both of these needs is with a popular tracker. So here are the pertinent questions:
1. Is your content fit for a popular tracker? Will it get buried behind all the Disney movies and porn? Does it even belong to an explicit category?
If not, then you are probably going to end up running your own tracker. Does that just mean hosting a CDN with extra steps? Cloud storage is quite cheap, and the corporate consolidation of the internet by Cloudflare, Amazon, etc. has resulted in a network infrastructure that is optimized for that kind of traffic, not for bittorrent.
2. Is a popular tracker a good fit for your content? Will your prospective downloaders even think to look there? Will they be offended by the other content on that tracker, and leave?
Again, a no will lead to you making your own tracker. Even in the simplest case, will users even bother to click your magnet link, or will they just use the regular CDN download that they are used to?
So what about package repos? Personally, I think this would be a great fit, particularly for Nix, but it's important to be explicit about participation. Seeding is a bad default for many reasons, which means you still need a relatively reliable CDN/seed anyway.
---
The internet has grown into an incredibly hierarchical network, with incredibly powerful and authoritative participants. I would love to see a revolution in decentralized computing. All of the technical needs are met, but the sociopolitical needs need serious attention. Every attempt at decentralized content distribution I have seen has met the same fate: drowned in offensive and shallow content by those who are most immediately excited to be liberated from authority. Even if it technically works, it just smells too foul to use.
I propose a new strategy to replace moderation: curation. Instead of relying on authority to block out undesirable content, we should use attested curation to filter in desirable content.
Want to give people the option to browse an internet without porn? Clearly and publicly attest which content is porn. Don't light the shit on fire, just open the windows and let it air out.
People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.
Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...
However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".
> Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing
This applies to anyone unskilled in a profession. I can assure you, we're not all out here hammering the shit out of any API we find.
With the accessibility of programming to just about anybody, and particularly now with "vibe-coding" it's going to happen.
Slap a 429 (Too Many Requests) in your response or something similar using a leaky-bucket algo and the junior dev/apprentice/vibe coder will soon learn what they're doing wrong.
> Slap a 429 [...] will soon learn what they're doing wrong.
Oh how I wish this was true. We have customers sending 10-100s requests per second and they will complain if even just one gets 429. As in, they escalate to their enterprise account rep. I always tell them to buy the customer a "error handling for dummies" book but they never learn.
Thanks for the reply - I did not mean to rant, but, unfortunately, this is in context of a B2B service, and the other side are most commonly IT teams of customers.
There are, of course, both very capable and professional people, and also kind people who are keen to react / learn, but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc...
One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions - a humble expectation would be several requests/day to update batches of entities.
I don't understand why features like S3's "downloader pays" isn't more widely used (and available outside AWS). Let the inefficient consumer bear their own cost.
Major downside is that this would exclude people without access to payment networks, but maybe you could still have a rate-limited free option.
Their download service does not require authentication, and they are kind enough to be hesitant about blocking IPs (one IP could be half of a university campus, for example). So that leaves chasing around to find an individual culprit and hoping they'll be nice and fix it.
Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.
Optimising CI pipelines has been a strong aspect of my career so far.
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).
It does if you are building on the same host with preserved state and didn't clean it.
There are lots of cases where people end up with with an empty docker repo at every CI run or regularly empty the repo because docker doesn't have any sort of intelligence space management (like LRU).
To get fine-grained caching you need to use cache-mounts, not just cache layers. But the cache export doesn't include cache mounts, therefore the docker github action doesn't export cache mounts to the CI cache.
"There have been individual clients downloading the exact same 20-GB file 100s of times per day, for several days in a row. (Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!) Others download every single file we have on the server, every day."
This sounds like problem rate-limiting would easily solve. What am I missing. The page claims almost 10,000 copies of same file were downloaded by the same user
The server operator is able to count the number of downloads in a 24h period for an individual user but cannot or will not set a rate limit
Why not
Will the users mentioned above (a) read the operator's message on this web page and then (b) change their behaviour
I would be bet against (a) and therefore (b) as well
Geofabrik guy here. You are right - rate limiting is the way to go. It is not trivial though. We use an array of Squid proxies to serve stuff and Squid's built-in rate limiting only does IPv4. While most over-use comes from IPv4 clients it somehow feels stupid to do rate limiting on IPv4 and leave IPv6 wide open. What's more, such rate-limiting would always just be per-server which, again, somehow feels wrong when what one would want to have is limiting the sum of traffic for one client across all proxies... then again, maybe we'll go for the stupid IPv4-per-server-limit only since we're not up against some clever form of attack here but just against carelessness.
Oh hey, it's me, the dude downloading italy-latest every 8 seconds!
Maybe not, but I can't help but wonder if anybody on my team (I work for an Italian startup that leverages GeoFabrik quite a bit) might have been a bit too trigger happy with some containerisation experiments. I think we got banned from geofabrik a while ago, and to this day I have no clue what caused the ban; I'd love to be able to understand what it was in order to avoid it in the future.
I've tried calling and e-mailing the contacts listed on geofabrik.de, to no avail. If anybody knows of another way to talk to them and get the ban sorted out, plus ideally discover what it was from us that triggered it, please let me know.
Hey there dude downloading italy-latest every 8 seconds, nice to hear from you. I don't think I saw an email from you at info@geofabrik, could you re-try?
Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.
1. BitTorrent has a bad rep. Most people still associate it with just illegal download.
2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.
3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.
4. A lot of people think they are required to seed, and for some reason that scare the hell of them.
Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.
Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users
You know what has a bad rep? Big companies that use and trade my personal information like they own it. I'll start caring about copyrights when governments force these big companies to care about my information.
This basically handles every problem stated. There's nothing to install on computers: it's just js running on the page. There's no firewall rules or port forwarding to setup, all handled by the stun/turn in webrtc. Users wouldn't necessarily even be aware they are uploading.
Well, in many such situations data is provided for free, putting huge burden on the other side. Even it it's a little bit less convenient it makes service a lot more sustainable. I imagine torrent for free tier and direct download as a premium option would work perfectly
6. Service providers have little control over the service level of seeders and thus the user experience. And that's before you get malicious users.
but you are already uploading while you are still downloading. and that can't be turned off. if seeding scares someone, then uploading should scare them too. so they are right, because they are required to upload.
If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...
[1]: https://www.bittorrent.org/beps/bep_0046.html
I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.
Would something like Git LFS help here? I’m at the limit of my understanding for this.
It took maybe 10 seconds longer for the downloads to start, but they then ran almost as fast as the master could upload one copy.
It became disliked because of various problems and complaints, but mainly disappeared because Blozzard got the bright idea of preloading the patchset, especially to new expansions, in the weeks before. You can send down a ten gig patch a month before release, and then patch that patch a week before release, and a final small patch on the day before release, and everything is preloaded.
The great Carboniferous explosion of CDNs inspired by Netflix and friends has also greatly simplified the market, too.
https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper
If A is a copyrighted work, and B is pure noise, then C=A^B is also pure noise.
Distribute B and C. Both of these files have nothing to do with A, because they are both pure noise.
However, B^C gives you back A.
The basic idea being, can you using the digits of Pi to encode data, or rather, can you find ranges of Pi that map to data you have and use it for "compression".
A very simple example, let's take this portion of Pi:
> 3.14159265358979323846264338327950288419716939937
Then let's say we have a piece of data that, when encoded and just numbers, results in: 15926535897997169626433832
Can we encode that as: 4–15, 39–43, 21–25, 26–29 and save space? The "compression" step would take a long time (at some point you have to stop searching for overlap as Pi goes on for forever).
Anyways, a silly thought experiment that your idea reminded me of.
Is C really "pure noise" if you can get A back out of it?
It's like an encoding format or primitive encryption, where A is merely transformed into unrecognizable data, meaningful noise, which still retains the entirety of the information.
I think a variant of bittorrent which may be successful in corporate and generally non-geek environments should have the following qualities:
It's so obvious that it must have been implemented, likely multiple times. It would not be well-known because the very purpose of such an implementation would be to not draw attention.WebTorrent is ubiquitous by now and also supported by Brave and many torrent clients. There is still much room to build, though. Get in, the water is warm!
https://en.wikipedia.org/wiki/WebTorrent#Adoption
https://rwmj.wordpress.com/2013/09/09/half-baked-idea-conten...
But yes I do think Bittorrent would be ideal for OSM here.
https://en.wikipedia.org/wiki/InterPlanetary_File_System
The developers always seemed focus on making new versions of the protocols with very minor changes (no more protocol buffers, move everything to CBOR) rather than actually adding new features like encryption support or making it more suitable for hosting static sites (which seems to have been on of its main niches).
It also would have been a great too for package repositories and other open source software archives. Large distros tend to have extensive mirror lists but you need to configure them, find out which ones have good performance for you and you can still only download from one mirror at a time. Decentralizing that would be very cool. Even if the average system doesn't seed any of the content the fact that anyone can just mirror the repo and downloads automatically start pulling from them was very nice. It also makes the download resilient to any official mirror going down or changing URL. The fact that there is strong content verification built in is also great. Typically software mirrors need to use additional levels of verification (like PGP signatures) to avoid trusting the mirror.
I really like the idea, and the protocol is pretty good overall. But the implementation and evolution really didn't work well in my opinion. I tried using it for a long time, offering many of my sites over it and mirroring various data. But eventually I gave up.
And maybe controversially it provided no capabilities for network separation and statistics tracking. This isn't critical for success but on entrypoint to this market is private file sharing sites. Having the option to use these things could give it a foot in the door and get a lot more people interested in development.
Hopefully the next similar protocol will come at some point, maybe it will catch on where IPFS didn't.
I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.
Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.
Not to mention, of course, that BitTorrent has a significant stigma attached to it.
The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.
For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".
The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.
BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.
As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.
Deleted Comment
It's not all about how you distribute content. We must also decide which content do distribute, and that is a hard problem.
The most successful strategy so far has been moderation. Moderation requires hierarchical authority: a moderator who arbitrarily determines which data is and is not allowed to flow. Even bittorrent traffic is moderated in most cases.
For data to flow over bittorrent, two things must happen:
1. There must be one or more seeders ready to connect when the leecher starts their download.
2. There must be a way for a prospective leecher to find the torrent.
The best way to meet both of these needs is with a popular tracker. So here are the pertinent questions:
1. Is your content fit for a popular tracker? Will it get buried behind all the Disney movies and porn? Does it even belong to an explicit category?
If not, then you are probably going to end up running your own tracker. Does that just mean hosting a CDN with extra steps? Cloud storage is quite cheap, and the corporate consolidation of the internet by Cloudflare, Amazon, etc. has resulted in a network infrastructure that is optimized for that kind of traffic, not for bittorrent.
2. Is a popular tracker a good fit for your content? Will your prospective downloaders even think to look there? Will they be offended by the other content on that tracker, and leave?
Again, a no will lead to you making your own tracker. Even in the simplest case, will users even bother to click your magnet link, or will they just use the regular CDN download that they are used to?
So what about package repos? Personally, I think this would be a great fit, particularly for Nix, but it's important to be explicit about participation. Seeding is a bad default for many reasons, which means you still need a relatively reliable CDN/seed anyway.
---
The internet has grown into an incredibly hierarchical network, with incredibly powerful and authoritative participants. I would love to see a revolution in decentralized computing. All of the technical needs are met, but the sociopolitical needs need serious attention. Every attempt at decentralized content distribution I have seen has met the same fate: drowned in offensive and shallow content by those who are most immediately excited to be liberated from authority. Even if it technically works, it just smells too foul to use.
I propose a new strategy to replace moderation: curation. Instead of relying on authority to block out undesirable content, we should use attested curation to filter in desirable content.
Want to give people the option to browse an internet without porn? Clearly and publicly attest which content is porn. Don't light the shit on fire, just open the windows and let it air out.
Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...
However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".
This applies to anyone unskilled in a profession. I can assure you, we're not all out here hammering the shit out of any API we find.
With the accessibility of programming to just about anybody, and particularly now with "vibe-coding" it's going to happen.
Slap a 429 (Too Many Requests) in your response or something similar using a leaky-bucket algo and the junior dev/apprentice/vibe coder will soon learn what they're doing wrong.
- A senior backend dev
Oh how I wish this was true. We have customers sending 10-100s requests per second and they will complain if even just one gets 429. As in, they escalate to their enterprise account rep. I always tell them to buy the customer a "error handling for dummies" book but they never learn.
There are, of course, both very capable and professional people, and also kind people who are keen to react / learn, but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc...
One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions - a humble expectation would be several requests/day to update batches of entities.
Major downside is that this would exclude people without access to payment networks, but maybe you could still have a rate-limited free option.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
For example GMP blocked GitHub:
https://www.theregister.com/2023/06/28/microsofts_github_gmp...
This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.
That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.
Then I learned about Docker.
There are lots of cases where people end up with with an empty docker repo at every CI run or regularly empty the repo because docker doesn't have any sort of intelligence space management (like LRU).
https://github.com/moby/buildkit/issues/1512
This sounds like problem rate-limiting would easily solve. What am I missing. The page claims almost 10,000 copies of same file were downloaded by the same user
The server operator is able to count the number of downloads in a 24h period for an individual user but cannot or will not set a rate limit
Why not
Will the users mentioned above (a) read the operator's message on this web page and then (b) change their behaviour
I would be bet against (a) and therefore (b) as well
Maybe not, but I can't help but wonder if anybody on my team (I work for an Italian startup that leverages GeoFabrik quite a bit) might have been a bit too trigger happy with some containerisation experiments. I think we got banned from geofabrik a while ago, and to this day I have no clue what caused the ban; I'd love to be able to understand what it was in order to avoid it in the future.
I've tried calling and e-mailing the contacts listed on geofabrik.de, to no avail. If anybody knows of another way to talk to them and get the ban sorted out, plus ideally discover what it was from us that triggered it, please let me know.
Edit: I sent you the email, it got bounced again with a 550 Administrative Prohibition. Will try my university's account as well.
Edit2: this one seems to have gone through, please let me know if you can't see it.
Deleted Comment