Download responsibly - Readit News

Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.

maeln · 3 months ago

I can imagine a few things :

1. BitTorrent has a bad rep. Most people still associate it with just illegal download.

2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.

3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.

4. A lot of people think they are required to seed, and for some reason that scare the hell of them.

Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.

joao · 3 months ago

At a previous job, I was downloading daily legal torrent data when IT flagged me. The IT admin, eager to catch me doing something wrong, burst into the room shouting with management in tow. I had to calmly explain the situation, as management assumed all torrenting was illegal and there had been previous legal issues with an intern pirating movies. Fortunately, other colleagues backed me up.

simonmales · 3 months ago

At least the planet download offers BitTorrent. https://planet.openstreetmap.org/

_def · 3 months ago

> A lot of people think they are required to seed, and for some reason that scare the hell of them.

Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users

amelius · 3 months ago

Bad rep ...

You know what has a bad rep? Big companies that use and trade my personal information like they own it. I'll start caring about copyrights when governments force these big companies to care about my information.

Fokamul · 3 months ago

Lol, bad rep? Interesting, in my country everybody is using it to download movies :D Even more so now, after this botched streaming war. (EU)

jauntywundrkind · 3 months ago

Webtorrent exists. It uses webrtc to let users connect to each other. There's support in popular trackers.

This basically handles every problem stated. There's nothing to install on computers: it's just js running on the page. There's no firewall rules or port forwarding to setup, all handled by the stun/turn in webrtc. Users wouldn't necessarily even be aware they are uploading.

out_of_protocol · 3 months ago

> It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1

Well, in many such situations data is provided for free, putting huge burden on the other side. Even it it's a little bit less convenient it makes service a lot more sustainable. I imagine torrent for free tier and direct download as a premium option would work perfectly

account42 · 3 months ago

5. Most residential internet connections are woefully underprovisioned for upload so anything that uses it more (and yes you need people to seed for bittorrent to make sense) can slow down the entire connection.

6. Service providers have little control over the service level of seeders and thus the user experience. And that's before you get malicious users.

em-bee · 3 months ago

seeding is uploading after you are done downloading.

but you are already uploading while you are still downloading. and that can't be turned off. if seeding scares someone, then uploading should scare them too. so they are right, because they are required to upload.

marklit · 3 months ago

Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html

If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...

willtemperley · 3 months ago

It's just great that bounding box queries can be translated into HTTP range requests.

n4r9 · 3 months ago

As someone who works with mapping data for HGV routing, I've been keeping an eye on Overture. I wonder do you know if anyone has measured the data coverage and quality between this and proprietary datasets like HEREmaps? Does Overture supplement OSM road attributes (such as max height restrictions) where they can find better data from other sources?

detaro · 3 months ago

Overture is not just "OSM data in Parquet".

sp8962 · 3 months ago

Thanks for the blatantly marketing Overture on a Thread about downloading OSM data.

zaphodias · 3 months ago

I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).

[1]: https://www.bittorrent.org/beps/bep_0046.html

nativeit · 3 months ago

I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.

I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.

Would something like Git LFS help here? I’m at the limit of my understanding for this.

nativeit · 3 months ago

I certainly take advantage of BitTorrent mirrors for downloading Debian ISOs, as they are generally MUCH faster.

mschuster91 · 3 months ago

Trackers haven't been necessary for well over a decade now thanks to DHT.

craftkiller · 3 months ago

I used to work at a company that had to deliver huge files to every developer every week. At some point they switch from a thundering herd of rsyncs to using BitTorrent. The speed gains were massive.

Symbiote · 3 months ago

Our previous cluster management software used Bittorrent for distributing application images.

It took maybe 10 seconds longer for the downloads to start, but they then ran almost as fast as the master could upload one copy.

bombcar · 3 months ago

World of Warcraft used a BitTorrent-like protocol for patches for awhile, as a default option if I remember right. https://www.bluetracker.gg/wow/topic/us-en/10043224047-need-... As an example mentioning it.

It became disliked because of various problems and complaints, but mainly disappeared because Blozzard got the bright idea of preloading the patchset, especially to new expansions, in the weeks before. You can send down a ten gig patch a month before release, and then patch that patch a week before release, and a final small patch on the day before release, and everything is preloaded.

The great Carboniferous explosion of CDNs inspired by Netflix and friends has also greatly simplified the market, too.

trenchpilgrim · 3 months ago

> Like container registries?

https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper

amelius · 3 months ago

What I wonder about is why we don't use the XOR principle more.

If A is a copyrighted work, and B is pure noise, then C=A^B is also pure noise.

Distribute B and C. Both of these files have nothing to do with A, because they are both pure noise.

However, B^C gives you back A.

joshstrange · 3 months ago

I wouldn't expect that to hold up any more than a silly idea I had (probably not original) a while back of "Pi-Storage".

The basic idea being, can you using the digits of Pi to encode data, or rather, can you find ranges of Pi that map to data you have and use it for "compression".

A very simple example, let's take this portion of Pi:

> 3.14159265358979323846264338327950288419716939937

Then let's say we have a piece of data that, when encoded and just numbers, results in: 15926535897997169626433832

Can we encode that as: 4–15, 39–43, 21–25, 26–29 and save space? The "compression" step would take a long time (at some point you have to stop searching for overlap as Pi goes on for forever).

Anyways, a silly thought experiment that your idea reminded me of.

lioeters · 3 months ago

> C=A^B is also pure noise

Is C really "pure noise" if you can get A back out of it?

It's like an encoding format or primitive encryption, where A is merely transformed into unrecognizable data, meaningful noise, which still retains the entirety of the information.

bmn__ · 3 months ago

What is the point of this? If you think you can mount an adequate defense based on xor in a court of law, then you are sorely mistaken. Any state attorney will say infringement with an additional step of obfuscation is still infringement, and any judge will follow that assessment.

1718627440 · 3 months ago

That's just encryption with a one-time pad, nothing new...

nine_k · 3 months ago

To use bittorrent, your machine has to listen, and otherwise be somehow reachable. In many cases, it's not a given, and sometimes not even desirable. It sticks out.

I think a variant of bittorrent which may be successful in corporate and generally non-geek environments should have the following qualities:

  - Work via WebSockets.
  - Run in browser, no installation. 
  - Have no rebel reputation.

It's so obvious that it must have been implemented, likely multiple times. It would not be well-known because the very purpose of such an implementation would be to not draw attention.

baobun · 3 months ago

https://github.com/webtorrent/webtorrent

WebTorrent is ubiquitous by now and also supported by Brave and many torrent clients. There is still much room to build, though. Get in, the water is warm!

https://en.wikipedia.org/wiki/WebTorrent#Adoption

rwmj · 3 months ago

With some fairly minimal changes to HTTP it would be possible to get much of the benefit of bittorrent while keeping the general principals of HTTP:

https://rwmj.wordpress.com/2013/09/09/half-baked-idea-conten...

But yes I do think Bittorrent would be ideal for OSM here.

kmfrk · 3 months ago

IPFS looked like a fun middle ground, but it didn't take off. Probably didn't help that it caught the attention of some Web 3.0 people.

https://en.wikipedia.org/wiki/InterPlanetary_File_System

kevincox · 3 months ago

In my experience the official software was very buggy and unreliable. Which isn't great for something about making immutable data live forever. I had bugs with silent data truncation, GC deleting live paths and the server itself just locking up and not providing anything it had to the network.

The developers always seemed focus on making new versions of the protocols with very minor changes (no more protocol buffers, move everything to CBOR) rather than actually adding new features like encryption support or making it more suitable for hosting static sites (which seems to have been on of its main niches).

It also would have been a great too for package repositories and other open source software archives. Large distros tend to have extensive mirror lists but you need to configure them, find out which ones have good performance for you and you can still only download from one mirror at a time. Decentralizing that would be very cool. Even if the average system doesn't seed any of the content the fact that anyone can just mirror the repo and downloads automatically start pulling from them was very nice. It also makes the download resilient to any official mirror going down or changing URL. The fact that there is strong content verification built in is also great. Typically software mirrors need to use additional levels of verification (like PGP signatures) to avoid trusting the mirror.

I really like the idea, and the protocol is pretty good overall. But the implementation and evolution really didn't work well in my opinion. I tried using it for a long time, offering many of my sites over it and mirroring various data. But eventually I gave up.

And maybe controversially it provided no capabilities for network separation and statistics tracking. This isn't critical for success but on entrypoint to this market is private file sharing sites. Having the option to use these things could give it a foot in the door and get a lot more people interested in development.

Hopefully the next similar protocol will come at some point, maybe it will catch on where IPFS didn't.

opan · 3 months ago

I used IPFS several years ago to get some rather large files from a friend, who had recently been interested in IPFS. From what I recall it took a full week or so to start actually transferring the files. It was so slow and finicky to connect. Bittorrent is dramatically easier to use, faster, and more reliable. It was hard to take IPFS seriously after that. I also recall an IRC bot that was supposed to post links to memes at IPFS links and they were all dead, even though it's supposed to be more resilient. I don't have the backstory on that one to know how/why the links didn't work.

vaylian · 3 months ago

> Like container registries? Package repos, etc.

I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.

dotwaffle · 3 months ago

From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.

Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.

Not to mention, of course, that BitTorrent has a significant stigma attached to it.

The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.

For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".

The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.

BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.

rlpb · 3 months ago

> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...

As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.

eleveriven · 3 months ago

I think a big part of why it's not more widely used comes down to a mix of UX friction, NAT/firewall issues, and lack of incentives

foobarian · 3 months ago

I think the reason is mainly that modern pipes are big enough that there is no need to bother with a protocol as complex as BT.

Deleted Comment

krautsauer · 3 months ago

I agree with the sentiment but I need those files behind a corporate firewall. :(

charcircuit · 3 months ago

AFAIK Bittorrent doesn't allow for updating the files for a torrent.

out_of_protocol · 3 months ago

This is a feature, not a bug. Torrent file/magnet link contains hash of a data which is immutable. Just publish new link (you should anyway, even with http)

Fornax96 · 3 months ago

It is technically possible, and there is a proposal to standardize it, but it has been in draft state for nearly 10 years https://www.bittorrent.org/beps/bep_0046.html

7952 · 3 months ago

A lot of people will be using this data at work where BitTorrent is a non-starter.

RobotToaster · 3 months ago

Or IPFS/IPNS

thomastjeffery · 3 months ago

I have a more direct answer for you: moderation.

It's not all about how you distribute content. We must also decide which content do distribute, and that is a hard problem.

The most successful strategy so far has been moderation. Moderation requires hierarchical authority: a moderator who arbitrarily determines which data is and is not allowed to flow. Even bittorrent traffic is moderated in most cases.

For data to flow over bittorrent, two things must happen:

1. There must be one or more seeders ready to connect when the leecher starts their download.

2. There must be a way for a prospective leecher to find the torrent.

The best way to meet both of these needs is with a popular tracker. So here are the pertinent questions:

1. Is your content fit for a popular tracker? Will it get buried behind all the Disney movies and porn? Does it even belong to an explicit category?

If not, then you are probably going to end up running your own tracker. Does that just mean hosting a CDN with extra steps? Cloud storage is quite cheap, and the corporate consolidation of the internet by Cloudflare, Amazon, etc. has resulted in a network infrastructure that is optimized for that kind of traffic, not for bittorrent.

2. Is a popular tracker a good fit for your content? Will your prospective downloaders even think to look there? Will they be offended by the other content on that tracker, and leave?

Again, a no will lead to you making your own tracker. Even in the simplest case, will users even bother to click your magnet link, or will they just use the regular CDN download that they are used to?

So what about package repos? Personally, I think this would be a great fit, particularly for Nix, but it's important to be explicit about participation. Seeding is a bad default for many reasons, which means you still need a relatively reliable CDN/seed anyway.

---

The internet has grown into an incredibly hierarchical network, with incredibly powerful and authoritative participants. I would love to see a revolution in decentralized computing. All of the technical needs are met, but the sociopolitical needs need serious attention. Every attempt at decentralized content distribution I have seen has met the same fate: drowned in offensive and shallow content by those who are most immediately excited to be liberated from authority. Even if it technically works, it just smells too foul to use.

I propose a new strategy to replace moderation: curation. Instead of relying on authority to block out undesirable content, we should use attested curation to filter in desirable content.

Want to give people the option to browse an internet without porn? Clearly and publicly attest which content is porn. Don't light the shit on fire, just open the windows and let it air out.

Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.

Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.

userbinator · 3 months ago

I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.

Gigachad · 3 months ago

This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.

stevage · 3 months ago

Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:

> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

ahlCVA · 3 months ago

Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.

comprev · 3 months ago

Optimising CI pipelines has been a strong aspect of my career so far.

Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.

Not just caching but optimising job execution order and downstream dependencies too.

The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.

I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.

raverbashing · 3 months ago

Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.

account42 · 3 months ago

CI is great for software reliability but it should not be allowed to make network requests.

mschuster91 · 3 months ago

CI itself doesn't have to be a waste. The problem is most people DGAF about caching.

marklit · 3 months ago

I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.

aitchnyu · 3 months ago

Can we identify requests from CI servers reliably?

IshKebab · 3 months ago

You can identify requests from Github's free CI reliably which probably covers 99% of requests.

For example GMP blocked GitHub:

https://www.theregister.com/2023/06/28/microsofts_github_gmp...

This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.

Gigachad · 3 months ago

Sure, have a js script involved in generating a temporary download url.

That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.

eleveriven · 3 months ago

Having some kind of lightweight auth (API key, even just email-based) is a good compromise