In his reflections, Brewster Kahle mentions his goal of creating “a library available to anybody, anywhere in the world.” He doesn’t mention, though, the costs of making that library available to the world for free or the fact that the Internet Archive accepts donations. So I will:
Donations are fantastic but if you have engineering (or project management, design, etc) skills spending just 1 hour a week contributing to their open source goes a very long way!
Nice they're there. At the same time, it's amazingly easy for content to be removed from there - if someone objects or even if things are murky.
For example, all content from the old ezboard site was been removed based on the configuration of the current URL owners' robots.txt, and current URL owner is just a domain parker. Ezboard hosted a lot of content back in the day.
This is an old problem I could have sworn there were promises they were going to change their procedures.
The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.
If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.
Hopefully it is just hidden and not deleted. But this is the main reason why alternative archive sites exists which ignore the original posters requests. Frequently used to archive posts from public figures which are suspected to be attempted to be scrubbed later.
Hidden. Even when you request for them to remove stuff.
Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.
As far as I was able to experience it's just hidden and not deleted.
I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.
But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.
A search of youtube for "wayback machine" produces pages of stuff about the Internet Archive, and only the 24th result has anything to do with the origin of the term.
People who didn't spend their Saturday mornings glued in front of the TV screen as a child of the 1970's might not remember how American kids learned about history back then:
Peabody's Improbable History - Surrender of Cornwallis
Peabody and Sherman travel back to October 19, 1781 to witness when Cornwallis surrendered for Washington. However, when they got there, then he didn't show up.
FYI, it's not nowadays obscure. There's a current series The Mr. Peabody & Sherman Show on Netflix, and Hollywood made a Mr. Peabody and Sherman movie (by Dreamworks) in 2014.
Without trying to be contrarian, I don't think that everything should be archived. Random tweets, random blog posts, random personal web sites. Let them wither and die and be forgotten. Notable content by notable people? Sure.
Everyone else ought to have the right to be forgotten, including some drunk tweet they wrote 10 years ago and regret, or an old personal page which contained too much PII.
Archive no longer has a way to opt-out, which is bad enough, but I still think they should be opt-in.
Perhaps the Internet Archive could do more to help people who find their personal/sensitive/embarrassing content made available in perpetuity (I’m not sure exactly what they could do), but it’s incredibly valuable to have archiving on by default. The voice of un-notable people is underrepresented in every field of study, and the voice of notable people tends to get preserved in other ways anyway.
That's lovely from the perspective of a historian 200 years in the future. But it does not nothing to alleviate the pain of people living in the present. People whose prospective employers comb over their embarrassing past. Or bullies. Or any other number of evildoers whose life is made easier by unfettered access to indelible information.
Most of history we know is from the perspective of the wealthies 0.1%. Even though the Internet is still biased towards the wealthier and more educated, having a history of the wealthies 10% would be enormous progress.
Historians spend a lot of time pouring over minutia from non-notable people. There is plenty about the world that doesn't seem worth writing down, but can be intuited from tangential texts.
From what I understand, present day historians are sitting on a huge pile of cuneiform tablets that have yet to be transcribed or translated because there is much more material than there is interest/manpower.
Of course, to your point, they do keep it around. They don't just throw it in the trash.
Who decides what "notable" is? I frequently use the Archive to find old academic grey lit (preprints, lecture notes, newsgroup posts, etc.). Much of it is on "random" blog posts and personal websites. Even the authors aren't usually notable by Wikipedia standards. Yeah, there is some PII on those pages, but also treasures of useful information.
The Internet Archive location is beautiful. it's a church that has been partially turned into a server farm. Big Neuromancer energy when you go inside and look.
A little known trivia: Apache Hadoop (and the multi-billion dollar open source big-data ecosystem it spawned) was worked upon at first at Internet Archive [0].
Speaking of billions: According to Kahle, Alexa Internet's compute infrastructure informed Amazon's take on IaaS (AWS) [1].
Another perhaps lost nugget is Amazon once funded (either in part or in full) the development of the Wayback Machine, Internet Archive's most impactful product. In addition, till date (if I'm not mistaken) Amazon continues to donate data it fetches from Alexa Toolbar installations to the Wayback Machine.
I love the Internet Archive; I worry that its utility will wane as content becomes more dynamic than static. What does it mean to archive the experience of scrolling through a social feed?
The paid, legal-oriented, archiving service Perma.cc that Harvard Law runs actually lets you upload your own PDFs and screenshots in addition to allowing Perma.cc's bots capture webpages. Of course, since you could upload anything the difference is made clear in the UI.
In a legal context, simply attesting to the validity of a screenshot is really common. So when that functionality is used Perma.cc is operating more as a permanent file storage service than a trusted archive.
Regardless, this does go a long way to solving the problem of dynamic sites.
> actually lets you upload your own PDFs and screenshots in addition to allowing Perma.cc's bots capture webpages
FWIW: the Wayback Machine is just one part of the Internet Archive. The quoted bit accurately describes things you can do with an archive.org account, too. Readers here may be familiar with the archive.org-affiliated effort by a team specifically working to recreate the playability of old PC (and otherwise) video games with JSMESS.
> this does go a long way to solving the problem of dynamic sites
Maybe, but the "dynamic" aspect that I'm sure the other person had in mind doesn't have much to do with the D in DHTML so much as it has to do with the dynamism that arises when you have a smart server responding to requests from a fat, JS-powered frontend. It would be possible to accurately model this in and execute it from a series of static assets, in some cases, but it's rarely done.
Even many sites built with static site generators today are not going to be usable in the future. There's too much tight coupling to the environment/deployment configuration and not enough semantic richness to properly hint to the crawler what resources are necessary to archive. In the heydey of XML, it used to be a big deal to strive for machine readable documents. Today's resume-driven development-obsessed webdevs effectively cast a vote of no confidence even in HTML, doing an end-run around it daily, and figuratively holding up a middle finger to the Principle of Least Power.
> This library would have all the published works of humankind. This library would be available not only to those who could pay the $1 per minute that LexusNexus charged, or only at the most elite universities. This would be a library available to anybody, anywhere in the world. Could we take the role of a library a step further, so that everyone’s writings could be included–not only those with a New York book contract? Could we build a multimedia archive that contains not only writings, but also songs, recipes, games, and videos?
For every Sci-Hub trying to create the library of Alexandria, there's an Elsevier trying to burn it down.
Current copyright law is largely on the side of the arsonists rather than the archivists.
(note: recipes are not copyrighted, though cookbooks are)
https://archive.org/donate/
Open Library in particular has a very active repo with lots of volunteers, a weekly community call, and a rather accessible codebase. https://github.com/internetarchive/openlibrary
If anyone knows webpack well would LOVE to have this dev-facing issue resolve to auto reload CSS https://github.com/internetarchive/openlibrary/issues/4955
Will send moore when I have more and when I've learned to be more generous. It's good to know that you're near Internet Archive.
But, oh, what a wonderful feeling
Just to know that you are near
Sets my a heart a-reeling
From my toes up to my ears
-Bob Dylan, The man in me
Internet Archive: https://projects.propublica.org/nonprofits/organizations/943...
Wikimedia Foundation: https://projects.propublica.org/nonprofits/organizations/200...
Mozilla Foundation: https://projects.propublica.org/nonprofits/organizations/200...
Electronic Frontier Foundation: https://projects.propublica.org/nonprofits/organizations/430...
For example, all content from the old ezboard site was been removed based on the configuration of the current URL owners' robots.txt, and current URL owner is just a domain parker. Ezboard hosted a lot of content back in the day.
https://archive.org/post/560730/ezboard-is-there-any-hope
The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.
If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.
Hidden. Even when you request for them to remove stuff.
Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.
I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.
But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.
(I checked and ezboard is still excluded.)
People who didn't spend their Saturday mornings glued in front of the TV screen as a child of the 1970's might not remember how American kids learned about history back then:
Peabody's Improbable History - Surrender of Cornwallis
Peabody and Sherman travel back to October 19, 1781 to witness when Cornwallis surrendered for Washington. However, when they got there, then he didn't show up.
https://www.youtube.com/watch?v=3E8zmaOiCVw&ab_channel=bullw...
Everyone else ought to have the right to be forgotten, including some drunk tweet they wrote 10 years ago and regret, or an old personal page which contained too much PII.
Archive no longer has a way to opt-out, which is bad enough, but I still think they should be opt-in.
For example - the graffiti at Pompeii is interesting (and is pretty much at the same "quality bar" as Twitter):
https://www.theatlantic.com/technology/archive/2016/03/adrie...
https://kashgar.com.au/blogs/history/the-bawdy-graffiti-of-p...
The costs seems low enough to just keep it.
Of course, to your point, they do keep it around. They don't just throw it in the trash.
Well, it isn't named the Internet Encyclopedia, for a reason.
> Without trying to be contrarian, I don't think that everything should be archived
It isn't contrarian. The deletionists are seemingly the majority. It is contrarian to in fact archive all. the. things.
https://www.atlasobscura.com/places/internet-archive-headqua...
Speaking of billions: According to Kahle, Alexa Internet's compute infrastructure informed Amazon's take on IaaS (AWS) [1].
Another perhaps lost nugget is Amazon once funded (either in part or in full) the development of the Wayback Machine, Internet Archive's most impactful product. In addition, till date (if I'm not mistaken) Amazon continues to donate data it fetches from Alexa Toolbar installations to the Wayback Machine.
[0] https://archive.is/Le3id
[1] https://archive.is/EnzHq
In a legal context, simply attesting to the validity of a screenshot is really common. So when that functionality is used Perma.cc is operating more as a permanent file storage service than a trusted archive.
Regardless, this does go a long way to solving the problem of dynamic sites.
FWIW: the Wayback Machine is just one part of the Internet Archive. The quoted bit accurately describes things you can do with an archive.org account, too. Readers here may be familiar with the archive.org-affiliated effort by a team specifically working to recreate the playability of old PC (and otherwise) video games with JSMESS.
> this does go a long way to solving the problem of dynamic sites
Maybe, but the "dynamic" aspect that I'm sure the other person had in mind doesn't have much to do with the D in DHTML so much as it has to do with the dynamism that arises when you have a smart server responding to requests from a fat, JS-powered frontend. It would be possible to accurately model this in and execute it from a series of static assets, in some cases, but it's rarely done.
Even many sites built with static site generators today are not going to be usable in the future. There's too much tight coupling to the environment/deployment configuration and not enough semantic richness to properly hint to the crawler what resources are necessary to archive. In the heydey of XML, it used to be a big deal to strive for machine readable documents. Today's resume-driven development-obsessed webdevs effectively cast a vote of no confidence even in HTML, doing an end-run around it daily, and figuratively holding up a middle finger to the Principle of Least Power.
https://www.w3.org/DesignIssues/Principles.html#PLP
To some extent, even a bunch of the projects associated with TBL's Solid initiative are guilty of doing the same.
For every Sci-Hub trying to create the library of Alexandria, there's an Elsevier trying to burn it down.
Current copyright law is largely on the side of the arsonists rather than the archivists.
(note: recipes are not copyrighted, though cookbooks are)