Why I link to Wayback Machine instead of original web content

I'm not sure I'm a fan of this because it just turns WayBackMachine into another content silo. It's called the world wide web for a reason, and this isn't helping.

I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.

silicon2401 · 5 years ago

> But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.

3pt14159 · 5 years ago

The issue is that if this becomes widespread then we're going to get into copyright claims against the wayback machine. When I write content it is mine. I don't even let Facebook crawlers index it because I don't want it appearing on their platform. I'm happy to have wayback machine archive it, but that's with the understanding that it is a backup, not an authoritative or primary source.

Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

PaulHoule · 5 years ago

It's a deep problem with the web as we know it.

If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.

In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.

ethagnawl · 5 years ago

> If it's not worth that, then I just keep a bookmark.

I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.

I keep meaning to write browser extensions to do both of these things on my behalf ...

ogre_codes · 5 years ago

I can understand posting a link, plus an archival link just in case the original content is lost. But linking to an archival site only is IMO somewhat rude.

Deleted Comment

codethief · 5 years ago

> What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine

Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.

thotsBgone · 5 years ago

I don't care enough to look into it, but I think Gwern has something like this set up on gwern.net.

ethagnawl · 5 years ago

Doesn't Wikipedia do something like this? If not, the WBM/Archive.org does something like it on Wikipedia's behalf.

abdullahkhalids · 5 years ago

Gwern.net has a pretty sophisticated system for this https://www.gwern.net/Archiving-URLs

mcv · 5 years ago

Would be nice if there's an automatic way to have a link revert to the Wayback Machine once the original link stops working. I can't think of an easy way to do that, though.

jazzyjackson · 5 years ago

Brave browser has this built in, if you end up at a dead link the address bar offers to take you to wayback machine.

http://blog.archive.org/2020/02/25/brave-browser-and-the-way...

boogies · 5 years ago

I just use a bookmarklet

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href.replace(/\/$/,%20'')));

(which is only slightly less convenient than what others have already pointed out — the FF extension and Brave built-in feature).

riffraff · 5 years ago

wikipedia just does "$some-link-here (Archived $archived-version-link)", and it works pretty well, imo.

MaxBarraclough · 5 years ago

Either a browser extension, or an 'active' system where your site checks the health of the pages it links to.

DavideNL · 5 years ago

Their browser extention does exactly that...

jrochkind1 · 5 years ago

The International Internet Preservation Consortium is attempting a technological solution that gives you the best of both worlds in a flexible way, and is meant to be extended to support multiple archival preservation content providers.

https://robustlinks.mementoweb.org/about/

(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)

akavel · 5 years ago

Or: snapshot a WARC archive of the site locally, then start serving it only in case the original goes down. For extra street cred, seed it to IPFS. (A.k.a. one of too many projects on my To Build One Day list.)

nikisweeting · 5 years ago

ArchiveBox is built for exactly this use-case :)

https://github.com/pirate/ArchiveBox

NateEag · 5 years ago

I use linkchecker for this on my personal sites:

https://linkchecker.github.io/linkchecker/

There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.

https://www.npmjs.com/package/broken-link-checker-local

privong · 5 years ago

> There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories

linkchecker can do this as well, if you provide it a directory path instead of a url.

polygot · 5 years ago

I made a browser extension which replaces links in articles and stackoverflow answers with archive.org links on the date of their publication (and date of answers for stackoverflow questions): https://github.com/alexyorke/archiveorg_link_restorer

DeusExMachina · 5 years ago

> generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.

deepstack · 5 years ago

yeah at some point, way back machine need to be on webttorrent, ipfs type of thing where it is immutable.

scruffyherder · 5 years ago

I was surprised when digital.com got purged

Then further dismayed that the utzoo Usenet archives were purged.

Archive sites are still subject to being censored and deleted.

toomuchtodo · 5 years ago

https://blog.archive.org/2018/07/21/decentralized-web-faq/

alfonsodev · 5 years ago

it's there any active project perusing this idea ?

FinnLeSueur · 5 years ago

> generate a report of broken links

I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.

https://github.com/finnito/link-checker

zwayhowder · 5 years ago

Not to forget that while I might go to an article written ten years ago, the Wayback archive won't show me a related article that you published two years ago updating the article information or correcting a mistake.

mark-r · 5 years ago

And when you die, who will be maintaining your personal site? What happens when the domain gets bought by a link scammer?

Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.

canofbars · 5 years ago

Wayback machine converts all links on a page to wayback links so you can navigate a dead site normally.

scruffyherder · 5 years ago

I spent hours getting all the stupid redirects working from different hosts, domains and platforms.

People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.

It’s a shame ping backs were hijacked but the siloing sucks too.

Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.

1vuio0pswjnm7 · 5 years ago

What if your personal site is, like so many others these days, on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.

In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)

I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.

bad_user · 5 years ago

The "target site" is the URL from the author's domain, and Cloudflare is the domain's designated CDN. The user is reaching the server that the webmaster wants reachable.

That's how the web works.

> The nice thing about archive.org is that it does not require SNI

I fail to see how that's even a thing to consider.

This is a bad idea...

In the worst case one might write a cool article and get two hits, one noticing it exists, and the other from the archive service. After that it might go viral, but the author may have given up by then.

The author is losing out on inbound links so google thinks their site is irrelevant and gives it a bad pagerank.

All you need to do is get archive.org to take a copy at the time, you can always adjust your link to point to that if the original is dead.

bryanrasmussen · 5 years ago

There's no reason that pagerank couldn't be adapted to take into account wayback machine urls, there is a link with a url pointing at https://web.archive.org/web/*/https://news.ycombinator.com/ google could easily register that as a link to both resources - one to web.archive, the other to the site.

there is also no reason why that has to become a slippery slope, if anyone is going to ask "but where do you stop!!"

dmitriid · 5 years ago

After all, they did change their search to accommodate AMP. Changing it to take WebArchive into account is a) peanuts and b) is actually better for the web

ethanwillis · 5 years ago

Google shouldn't be the center of the Web. They could also easily determine where the archive link is pointing to and not penalize. But I guess making sure we align with Google's incentives is more important than just using the Web.

bartread · 5 years ago

> Google shouldn't be the center of the Web.

I agree, but are you suggesting it's going to be better if WayBackMachine is?

luckylion · 5 years ago

> But I guess making sure we align with Google's incentives is more important than just using the Web.

It's not about Google's incentives. It's about directing the traffic where it should go. Google is just the means to do so.

Build an alternative, I'm sure nobody wants Google to be the number one way of finding content, it's just that they are, so pretending they're not and doing something that will hurt your ability to have your content found isn't productive.

rchaud · 5 years ago

Every search engine uses the number of backlinks as one of the key factors in influencing search rank; it's a fundamental KPI when it comes to understanding whether a link is credible.

What is true for Google in this regard is also true of Bing, DDG and Yandex.

marcus_holmes · 5 years ago

I totally agree.

I guess the answer is "don't mess with your old site", but that's also impractical.

And I'm sorry, but if it's my site, then it's my site. I reserve the right to mess about with it endlessly. Including taking down a post for whatever reason I like.

I'm sorry if that conflicts with someone else's need for everything to stay the same but it's my site.

Also, if you're linking to my article, and I decide to remove said article, then surely that's my right? It's my article. Your right to not have a dead link doesn't supercede my right to withdraw a previous publication, surely?

pingpongchef · 5 years ago

You can go down this road, but it looks like you're advocating for each party to simply do whatever he wants. In which case the viewing party will continue to value archiving.

mitchdoogle · 5 years ago

I certainly don't know about legal rights, but I think the ethical thing is to make sure that any writings published as freely accessible should remain so forever. What would people think if an author went into every library in the world to yank out one of their books they no longer want to be seen?

I do think the author is wrong to immediately post links to archived versions of sources. At the least, he could link to both the original and archived.

johannes1234321 · 5 years ago

One can also do it similar to Wikipedia references sections, which links to the original and the memento in the archive. (Once the bot notices it's gone)

Additional benefit: Some edits are good (addendums, typo corrections etc.)

acatton · 5 years ago

archive.org sends the HTTP header

  Link: <https://example.com>; rel="original"

This can be used by search engines to adjust their ranking algorithms.

scruffyherder · 5 years ago

Even worse, when you have people using rss to wholesale copy your site and it’s updates and again that traffic and more importantly the engagement disappear.

It’s very demotivating