Excellent! ArchiveTeam have always been impressive this way. Some years ago, I was working at a video platform that had just announced it would be shutting down fairly soon. I forget how, but one way or another I got connected with someone at ArchiveTeam who expressed their interest in archiving it all before it was too late. Believing this to be a good idea, I gave them a couple of tips about where some of our device-sniffing server endpoints were likely to give them a little trouble, and temporarily "donated" a couple EC2 instances to them to put towards their archiving tasks.
Since the servers were mine, I could see what was happening, and I was very impressed. Within I want to say two minutes, the instances had been fully provisioned and were actively archiving videos as fast as was possible, fully saturating the connection, with each instance knowing to only grab videos the other instances had not already gotten. Basically they have always struck me as not only having a solid mission, but also being ultra-efficient in how they carry it out.
Title is imprecise, it's Archiveteam.org, not Archive.org. The Internet Archive is providing free hosting, but the archival work was done by Archiveteam members.
What ArchiveTeam mainly does is provide hand-made scripts to aggressively archive specific websites that are about to die, with a prioritization for things the community deems most endangered and most important. They provide a bot you can run to grab these scripts automatically and run them on your own hardware, to join the volunteer effort.
This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.
For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.
They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.
> Like they kinda seem like an unnecessary middle-man between the archive and archivee
They are the middlemen that collects the data to be archived.
In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org
ArchiveTeam delegates tasks to volunteers and themselves running the Archive Warrior VM, which does the actual archiving. The resultant archives are then centralized by ArchiveTeam and uploaded to the Internet Archive.
> What exactly is archiveteam's contribution? I don't fully understand.
If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.
They gathered up the links for processing, because Google doesn't just give a list of short links in use. So the links have to be brute-forcefully gathered first.
Can we build a blockchain/P2P-based web crawler that can create snapshots of the entire web with high integrity (peer verification)? The already-crawled pages would be exchanged through bulk transfer between peers. This would mean there is an "official" source of all web data. LLM people can use snapshots of this. This would hopefully reduce the amount of ill-behaved crawlers, so we will see less draconian anti-bot measures over time on websites, in turn making it easier to crawl. Does something like this exist? It would be so awesome. It would also allow people to run a search engine at home.
Common Crawl, while a massive dataset of the web does not represent the entirety of the web.
It’s smaller than Google’s index and Google does not represent the entirety of the web either.
For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.
Per google, shortened links “won't work after August 25 and we recommend transitioning to another URL shortener if you haven’t already.”
Am I missing something, or doesn’t this basically obviate the entire gesture of keeping some links active? If your shortened link is embedded in a document somewhere and can’t be updated, google is about to break it, no?
About to break it if it didn't seem 'actively used' in late 2024, yes. But if your document was being frequently read and the link actively clicked, it'll (now) keep working.
But as I said in sibling comment to yours, I don't see the point of the distinction, why not just continue them all, surely the mostly unused ones are even cheaper to serve.
I built a URL shortener years ago for fun. I don't have the resources that Google has, but I just hacked it together in Erlang using Riak KV and it did horizontally scale across at least three computer (I didn't have more at the time).
Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.
I don't understand the data on ArchiveTeam's page but, it seems like they have 35 terabytes of data (286.56TiB)? It's a lot larger than I'd have thought.
This leaves me wondering what the point is? What could it possibly cost to keep redirecting existing shortlinks that they consider unused/low activity already anyway?
(In addition to the higher activity ones parent link says they'll now continue to redirect.)
In another submission someone speculated the reason might be the unending churn of the Google tech stack that just makes low-maintenance stuff impossible.
The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.
I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs.
> A random URL I pulled from a Google search was 705 bytes.
705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.
The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear.
Ok but the destination pages are not at risk (or at least not any more than any random page on the web) so why spend any effort crawling them before all shortcuts have been saved?
3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that.
There used to be one such project (Pushshift), before the Reddit API change. You can download all the data and see all the info on the-eye, another datahoarder/preservationist group:
ArchiveTeam was doing that, but their stuff no longer works due to changes at Reddit. The wiki page about it links to some other groups doing Reddit archiving.
Since the servers were mine, I could see what was happening, and I was very impressed. Within I want to say two minutes, the instances had been fully provisioned and were actively archiving videos as fast as was possible, fully saturating the connection, with each instance knowing to only grab videos the other instances had not already gotten. Basically they have always struck me as not only having a solid mission, but also being ultra-efficient in how they carry it out.
Edit: Like they kinda seem like an unnecessary middle-man between the archive and archivee, but maybe I'm missing something.
This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.
For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.
They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.
They are the middlemen that collects the data to be archived.
In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org
(Source: ran a Warrior)
If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.
Enlisting in the Fight Against Link Rot - https://news.ycombinator.com/item?id=44877021 - Aug 2025 (107 comments)
Google shifts goo.gl policy: Inactive links deactivated, active links preserved - https://news.ycombinator.com/item?id=44759918 - Aug 2025 (190 comments)
Google's shortened goo.gl links will stop working next month - https://news.ycombinator.com/item?id=44683481 - July 2025 (222 comments)
Google URL Shortener links will no longer be available - https://news.ycombinator.com/item?id=40998549 - July 2024 (49 comments)
Ask HN: Google is sunsetting goo.gl on 3/30. What will be your URL shortener? - https://news.ycombinator.com/item?id=19385433 - March 2019 (14 comments)
Tell HN: Goo.gl (Google link Shortener) is shutting down - https://news.ycombinator.com/item?id=16902752 - April 2018 (45 comments)
Google is shutting down its goo.gl URL shortening service - https://news.ycombinator.com/item?id=16722817 - March 2018 (56 comments)
Transitioning Google URL Shortener to Firebase Dynamic Links - https://news.ycombinator.com/item?id=16719272 - March 2018 (53 comments)
that already exists, its called CommonCrawl:
https://commoncrawl.org/
It’s smaller than Google’s index and Google does not represent the entirety of the web either.
For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.
For digital preservation? We may discuss. For an LLM? Haha, no.
No, thank you.
Per google, shortened links “won't work after August 25 and we recommend transitioning to another URL shortener if you haven’t already.”
Am I missing something, or doesn’t this basically obviate the entire gesture of keeping some links active? If your shortened link is embedded in a document somewhere and can’t be updated, google is about to break it, no?
But as I said in sibling comment to yours, I don't see the point of the distinction, why not just continue them all, surely the mostly unused ones are even cheaper to serve.
Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.
(In addition to the higher activity ones parent link says they'll now continue to redirect.)
They already have plenty of unused compute /older hardware / CDN POPs, performant distributed data store and everything else possibly needed .
It would be cheaper than the free credits they giveaway just one startup to be on GCP.
I don’t think infra costs are a factor in a decision like this .
The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?
Deleted Comment
705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.
https://web.archive.org/web/20250125064617/http://www.superm...
There used to be one such project (Pushshift), before the Reddit API change. You can download all the data and see all the info on the-eye, another datahoarder/preservationist group:
https://the-eye.eu/redarcs/
> twitter
Not that I know of, and you haven't even been able to archive tweets on the Wayback machine for YEARS.
https://wiki.archiveteam.org/index.php/Reddit
https://github.com/ArthurHeitmann/arctic_shift
How would that even function, I mean, did they loop through every single permutation and see the result, or what exactly/ how would that work?
Deleted Comment