Readit News logoReadit News
bartread · 5 years ago
I'm not sure I'm a fan of this because it just turns WayBackMachine into another content silo. It's called the world wide web for a reason, and this isn't helping.

I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.

silicon2401 · 5 years ago
> But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.

3pt14159 · 5 years ago
The issue is that if this becomes widespread then we're going to get into copyright claims against the wayback machine. When I write content it is mine. I don't even let Facebook crawlers index it because I don't want it appearing on their platform. I'm happy to have wayback machine archive it, but that's with the understanding that it is a backup, not an authoritative or primary source.

Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

PaulHoule · 5 years ago
It's a deep problem with the web as we know it.

If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.

In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.

ethagnawl · 5 years ago
> If it's not worth that, then I just keep a bookmark.

I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.

I keep meaning to write browser extensions to do both of these things on my behalf ...

ogre_codes · 5 years ago
I can understand posting a link, plus an archival link just in case the original content is lost. But linking to an archival site only is IMO somewhat rude.

Deleted Comment

Deleted Comment

codethief · 5 years ago
> What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine

Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.

thotsBgone · 5 years ago
I don't care enough to look into it, but I think Gwern has something like this set up on gwern.net.
ethagnawl · 5 years ago
Doesn't Wikipedia do something like this? If not, the WBM/Archive.org does something like it on Wikipedia's behalf.
abdullahkhalids · 5 years ago
Gwern.net has a pretty sophisticated system for this https://www.gwern.net/Archiving-URLs
mcv · 5 years ago
Would be nice if there's an automatic way to have a link revert to the Wayback Machine once the original link stops working. I can't think of an easy way to do that, though.
jazzyjackson · 5 years ago
Brave browser has this built in, if you end up at a dead link the address bar offers to take you to wayback machine.

http://blog.archive.org/2020/02/25/brave-browser-and-the-way...

boogies · 5 years ago
I just use a bookmarklet

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href.replace(/\/$/,%20'')));
(which is only slightly less convenient than what others have already pointed out — the FF extension and Brave built-in feature).

riffraff · 5 years ago
wikipedia just does "$some-link-here (Archived $archived-version-link)", and it works pretty well, imo.
MaxBarraclough · 5 years ago
Either a browser extension, or an 'active' system where your site checks the health of the pages it links to.
DavideNL · 5 years ago
Their browser extention does exactly that...
jrochkind1 · 5 years ago
The International Internet Preservation Consortium is attempting a technological solution that gives you the best of both worlds in a flexible way, and is meant to be extended to support multiple archival preservation content providers.

https://robustlinks.mementoweb.org/about/

(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)

akavel · 5 years ago
Or: snapshot a WARC archive of the site locally, then start serving it only in case the original goes down. For extra street cred, seed it to IPFS. (A.k.a. one of too many projects on my To Build One Day list.)
nikisweeting · 5 years ago
ArchiveBox is built for exactly this use-case :)

https://github.com/pirate/ArchiveBox

NateEag · 5 years ago
I use linkchecker for this on my personal sites:

https://linkchecker.github.io/linkchecker/

There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.

https://www.npmjs.com/package/broken-link-checker-local

privong · 5 years ago
> There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories

linkchecker can do this as well, if you provide it a directory path instead of a url.

polygot · 5 years ago
I made a browser extension which replaces links in articles and stackoverflow answers with archive.org links on the date of their publication (and date of answers for stackoverflow questions): https://github.com/alexyorke/archiveorg_link_restorer
DeusExMachina · 5 years ago
> generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.

deepstack · 5 years ago
yeah at some point, way back machine need to be on webttorrent, ipfs type of thing where it is immutable.
scruffyherder · 5 years ago
I was surprised when digital.com got purged

Then further dismayed that the utzoo Usenet archives were purged.

Archive sites are still subject to being censored and deleted.

alfonsodev · 5 years ago
it's there any active project perusing this idea ?
FinnLeSueur · 5 years ago
> generate a report of broken links

I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.

https://github.com/finnito/link-checker

zwayhowder · 5 years ago
Not to forget that while I might go to an article written ten years ago, the Wayback archive won't show me a related article that you published two years ago updating the article information or correcting a mistake.
mark-r · 5 years ago
And when you die, who will be maintaining your personal site? What happens when the domain gets bought by a link scammer?

Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.

canofbars · 5 years ago
Wayback machine converts all links on a page to wayback links so you can navigate a dead site normally.
scruffyherder · 5 years ago
I spent hours getting all the stupid redirects working from different hosts, domains and platforms.

People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.

It’s a shame ping backs were hijacked but the siloing sucks too.

Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.

1vuio0pswjnm7 · 5 years ago
What if your personal site is, like so many others these days, on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.

In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)

I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.

bad_user · 5 years ago
The "target site" is the URL from the author's domain, and Cloudflare is the domain's designated CDN. The user is reaching the server that the webmaster wants reachable.

That's how the web works.

> The nice thing about archive.org is that it does not require SNI

I fail to see how that's even a thing to consider.

markjgraham · 5 years ago
We suggest/encourage people link to original URLs but ALSO (as opposed to instead of) provide Wayback Machine URLs so that if/when the original URLs go bad (link rot) the archive URL is available, or to give people a way to compare the content associated with a given URL over time (content drift)

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time... so that we are able to fix them if/when they break. We have rescued more than 10 million so far from more than 30 Wikipedia sites. We are now working to have Wayback Machine URLs added IN ADDITION to Live Web links when any new outlinks are added... so that those references are "born archived" and inherently persistent.

Note, I manage the Wayback Machine team at the Internet Archive. We appreciate all your support, advice, suggestions and requests.

jhallenworld · 5 years ago
It's interesting to think about how HTML could be modified to fix the issue. Initial thought: along with HREF, provide AREF- a list of archive links. The browser could automatically try a backup if the main one fails. The user should be able to right-click the link to select a specific backup. Another idea is to allow the web-page author to provide a rewrite rule to automatically generate wayback machine (or whatever) links from the original. This seems less error prone and browsers could provide a default that authors could override.

Anyway, the fix should work even with plain HTML. I'm sure there are a bunch of corner cases and security issues involved..

Well as mentioned by others, there is a browser extension. It's interesting to read the issues people have with it:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

javajosh · 5 years ago
So this is a little indirect, but it does avoid the case where the Wayback machine goes down (or is subverted): include a HASHREF which is a hash of the state of the content when linked. Then you could find the resource using the content-addressable system of your choice. (Including, it must be said, the wayback machine itself).
shortformblog · 5 years ago
This is literally where my brain was going and I was glad to see someone went in the same direction. Given the <img> tag’s addition of srcset in recent years, there is precedent for doing something more with href.
devenblake · 5 years ago
Yup, I've been using the extension for probably about a year now and get the same issues they do. It really isn't that bad, most of the time backing out of the message once or twice does the trick, but it's funny because most of the time I get that message when going to the IA web uploader.
Arkanosis · 5 years ago
This is so much better than INSTEAD.

Not for the sole reason that it leaves some control to the content owner while ultimately leaving the choice to the user, but also because things like updates and erratums (eg. retracted papers) can't be found in archives. When you have both, it's the best of both world: you have the original version, the updated version, and you can somehow have the diff between them. IMHO, this is especially relevant in when the purpose is reference.

tracker1 · 5 years ago
I mostly agree... however, given how many "news" sites are now going back and completely changing articles (headlines, content) without any history, I think it's a mixed bag.

Link rot isn't the only reason why one would want an archive link instead of original. Not that I'd want to overwhelm the internet archive's resources.

punnerud · 5 years ago
I love the feature that you easily can add a page to archive: https://web.archive.org/save/https://example.com

Replace https://example.com from the URL above. I try to respect the cost of archiving, by not saving to often the same page.

tdrp · 5 years ago
Thanks so much for running this site - as a small start-up we often manually request a snapshot of our privacy policy/terms of service/other important announcements whenever we make change to them (if we don't manually request them the re-crawl generally doesn't happen since I guess those pages are very rarely visited, even though they're linked from the main site). It's helped us in a thorny situation where someone tried to claim "it wasn't there when I signed up".

It might be an interesting use-case for you to check out, i.e. keep an eye of those rarely used legal sublinks for smaller companies.

Ziggy_Zaggy · 5 years ago
Kudos for doing what you do.
arendtio · 5 years ago
I always wonder about rise the hosting costs in the wake of people liking to the Wayback Machine on popular sites.

How do you think about it?

bherb · 5 years ago
shemnon42 · 5 years ago
Came here for this. Have my upvote.
outsomnia · 5 years ago
This is a bad idea...

In the worst case one might write a cool article and get two hits, one noticing it exists, and the other from the archive service. After that it might go viral, but the author may have given up by then.

The author is losing out on inbound links so google thinks their site is irrelevant and gives it a bad pagerank.

All you need to do is get archive.org to take a copy at the time, you can always adjust your link to point to that if the original is dead.

bryanrasmussen · 5 years ago
There's no reason that pagerank couldn't be adapted to take into account wayback machine urls, there is a link with a url pointing at https://web.archive.org/web/*/https://news.ycombinator.com/ google could easily register that as a link to both resources - one to web.archive, the other to the site.

there is also no reason why that has to become a slippery slope, if anyone is going to ask "but where do you stop!!"

dmitriid · 5 years ago
After all, they did change their search to accommodate AMP. Changing it to take WebArchive into account is a) peanuts and b) is actually better for the web
ethanwillis · 5 years ago
Google shouldn't be the center of the Web. They could also easily determine where the archive link is pointing to and not penalize. But I guess making sure we align with Google's incentives is more important than just using the Web.
bartread · 5 years ago
> Google shouldn't be the center of the Web.

I agree, but are you suggesting it's going to be better if WayBackMachine is?

luckylion · 5 years ago
> But I guess making sure we align with Google's incentives is more important than just using the Web.

It's not about Google's incentives. It's about directing the traffic where it should go. Google is just the means to do so.

Build an alternative, I'm sure nobody wants Google to be the number one way of finding content, it's just that they are, so pretending they're not and doing something that will hurt your ability to have your content found isn't productive.

rchaud · 5 years ago
Every search engine uses the number of backlinks as one of the key factors in influencing search rank; it's a fundamental KPI when it comes to understanding whether a link is credible.

What is true for Google in this regard is also true of Bing, DDG and Yandex.

marcus_holmes · 5 years ago
I totally agree.

I guess the answer is "don't mess with your old site", but that's also impractical.

And I'm sorry, but if it's my site, then it's my site. I reserve the right to mess about with it endlessly. Including taking down a post for whatever reason I like.

I'm sorry if that conflicts with someone else's need for everything to stay the same but it's my site.

Also, if you're linking to my article, and I decide to remove said article, then surely that's my right? It's my article. Your right to not have a dead link doesn't supercede my right to withdraw a previous publication, surely?

pingpongchef · 5 years ago
You can go down this road, but it looks like you're advocating for each party to simply do whatever he wants. In which case the viewing party will continue to value archiving.
mitchdoogle · 5 years ago
I certainly don't know about legal rights, but I think the ethical thing is to make sure that any writings published as freely accessible should remain so forever. What would people think if an author went into every library in the world to yank out one of their books they no longer want to be seen?

I do think the author is wrong to immediately post links to archived versions of sources. At the least, he could link to both the original and archived.

johannes1234321 · 5 years ago
One can also do it similar to Wikipedia references sections, which links to the original and the memento in the archive. (Once the bot notices it's gone)

Additional benefit: Some edits are good (addendums, typo corrections etc.)

acatton · 5 years ago
archive.org sends the HTTP header

  Link: <https://example.com>; rel="original"
This can be used by search engines to adjust their ranking algorithms.

scruffyherder · 5 years ago
Even worse, when you have people using rss to wholesale copy your site and it’s updates and again that traffic and more importantly the engagement disappear.

It’s very demotivating

CaptArmchair · 5 years ago
So, this is the problem of persistence of URL's always referencing the original content, regardless of where it is hosted, in an authoritative way.

It's an okay idea to link to WB, because (a) it's de facto assumed to be authoritative by the wider global community and (b) as an archive it provides a promise that it's URL's will keep pointing to the archived content come what may.

Though, such promises are just that: promises. Over a long period of time, no one can truly guarantee the persistence of a relationship between an URI and the resource it references to. That's not something technology itself solves.

The "original" URI still does carry the most authority, as that's the domain on which the content was first published. Moreover, the author can explicitly point to the original URI as the "canonical" URI in the HTML head of the document.

Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

Part of ensuring persistence is the responsibility of original publisher. That's where solutions such as URL resolving come into play. In the academic world, DOI or handle.net are trying to solve this problem. Protocols such as ORE or Memento further try to cater to this issue. It's a rabbit hole, really, when you start to think about this.

kapep · 5 years ago
> Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

WB also supports linking to the very latest version. If the archive is updated frequently enough I would say it is reasonable to link to that if you use WB just as a mirror. In some cases I've seen error pages being archived after the original page has been moved or removed though but that is probably just a technical issue caused by some website misconfiguration or bad error handling.

im3w1l · 5 years ago
Signed HTTP Exchanges could be a neat solution here.
ffpip · 5 years ago
You can create a bookmark in Firefox to save a link quickly.

Bookmark Location- https://web.archive.org/save/%s

Keyword - save

So searching 'save https://news.ycombinator.com/item?id=24406193' archives this post.

You can use any Keyword instead of 'save'.

You can also search with https://web.archive.org/*/%s

bad_user · 5 years ago
Does that `save` keyword work?

The problem is %s gets escaped, so Firefox generates this URL, which seems to be invalid:

https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator....

aendruk · 5 years ago
Uppercase %S for unescaped, e.g.:

https://web.archive.org/web/*/%S

ffpip · 5 years ago
web.archive.org automatically converts the https%3A%2F things to https:// for me. I noticed it many times.

If you are still facing problems, go to https://web.archive.org . In the bottom right 'Save page now' field, right click and select 'add keyword for search'. Choose your desired keyword.

kilroy123 · 5 years ago
Nice. I forgot how you can do that.

I just use the extension myself:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

ffpip · 5 years ago
Yeah. That requires access to all sites. I wasn't comfortable adding another addon with that permission.

The permission is just for a simple reason and should be off by default. It is so you can right click a link on any page and select 'archive' from the menu. Small function, but requires access to all sites.

badsectoracula · 5 years ago
One issue i have with this extension is that it randomly pops up the 'this site appears to be offline' (which overrides the entire page) even when the site actually works (i hit the back button and it appears). I have it installed for some time now and so far i have almost daily false negatives and only once actually it worked as intended.

Also there doesn't seem to be a way to open a URL directly from the extension which seems a weird omission, so i end up going to the archive site anyway since i very often want to find old long lost sites.

kibibu · 5 years ago
Can we update this link to point to the archive version?
drummer · 5 years ago
Brilliant
imhoguy · 5 years ago
This is building yet another silo and point of failure. We can't pass the entire Internet traffic thru WayBackMachine as its resources are limited.

Most preserving solutions are like that and at the end the funding or business priorities (google groups) become a serious problem.

I think we need something like web - distributed and dumb easy to participate and contribute a preservation space.

Look, there are Torrents available for 17 years [0]. Sure, there are some unintresting long gone but there is always a little chance somebody still has the file and someday becomes online with it.

I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is too complex for a layman contributor with a plain altruistic motivation. It should be like SETI@Home - fire and forget. Eventually integrated with a browser to cache content you star/bookmark and share when it is offline.

[0] https://torrentfreak.com/worlds-oldest-torrent-still-alive-a...