Readit News logoReadit News
luizfelberti · a year ago
> Then a couple of weeks ago, added [direct] links to the Wayback Machine

Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product...

EDIT:

Apparently they are collaborating but there are not much details [0]

[0] https://blog.archive.org/2024/09/11/new-feature-alert-access...

mrkramer · a year ago
>Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product

WebArchive link is hidden so deep in the "About the source" page that vast majority of Google users won't even know that it exists.

There is excellent browser extension called Web Archives[0] that hooks all major web archiving services e.g. Archive.is, Wayback Machine and others in one place.

[0] https://github.com/dessant/web-archives

lelandfe · a year ago
No kidding:

Click a result's three dots menu. Underneath all the main call to action buttons (Visit, share, save) is a Wikipedia description of the site. Underneath that is a "More about this page" button. On this separate page is a description of the company, social media links, reviews, generic results for the company, and, finally, some 1100px down, "See previous versions on Internet Archive's Wayback Machine" in a 14px font: https://imgur.com/a/IMgVDpV

What's the ETA for this being removed due to lack of use...

EasyMark · a year ago
That’s probably a good thing, people who really do research old archived stuff will dig and find it but others who casually click won’t bring archive.org to its knees

Dead Comment

krackers · a year ago
It'd be absolutely foolish if the agreement wasn't contingent on funding. I assume the reason it's not explicitly stated was some sort of NDA (since IA is also involved in turmoil and Google doesn't want to be part of that).
InDubioProRubio · a year ago
I wouldn't designate IP-holders attacking the longterm memory of mankind as turmoil. Digital dementia or ip-alzheimers seems more fitting.

Give a man a hypothetical infinite amount of meals and he will poison the village well so there will never be fishing again.

karlzt · a year ago
>> NDA

Non-disclosure agreement

gibibit · a year ago
I hope Google is NOT going to be a significant source of funding for the Internet Archive. Because I want to trust Wayback Machine and the Internet Archive to be unbiased.

Google likes to influence search results, hiding ones it doesn't like, and elevating those that the Company supports. Wayback Machine has been very reliable so far, I hope it stays that way.

aaroninsf · a year ago
Generally speaking, the Wayback Machine is not searchable in the fashion that Google is, there isn't a scale to put the thumb on.

(There are some tiny subsets which are rudimentarily full-text searchable; and some efforts to make domains findable. But nothing remotely like even Google 1.0 mapping URIs to organic terms.)

account42 · a year ago
IA needs an alternative - an independent backup archive - more than it needs funding. Unless IA funding exceeds the entire US copyright lobbying industry there is always a chance they will cease to exist without enough notice to save the data somewhere else.

There is also the matter what IA will be able to archive. The the machine learning gold rush more and more site operators see dollar bills in front of them and are restricting who can crawl their content. Google is in a special position here because almost no one can affort not to be crawled by Google which is what made their cache especially valuable in addition to the IA.

runxel · a year ago
Very sad to see it gone. It was always some kind of last resort. Internet Archive is lovely, don't get me wrong, but it relies mostly on people actively queueing up sites to save.

So most of the time for more obscure sites where the bitrot was already in place and they aren't loading anymore you could use the Google cache to get something out of it – where IA had nothing.

DaoVeles · a year ago
I do worry about the future of IA. Simply because of some of their reckless moves with their book lending policy, they have opened themselves up to being bleed dry financially. That plus the amount of copyright infringement openly available on the site is just waiting to be attacked.

I am waiting for Nintendo to get wind of the huge ROM dumps on there, it is not going to pretty. No manner of 'moral high ground' will defend against lawyers.

Gud · a year ago
I disagree. I am happy the Internet Archive are fighting the draconian copy right laws that exist.
iamleppert · a year ago
Google Cache was useful because you could sometimes not find a term or keyword in the web site, but it would be in the cache. Or for sites that have gone offline, or no longer have the item. "It's still in the Google Cache!" you can't say that anymore.

I use Google less and less these days. What's the point when you can just ask an LLM, and it gives you an answer within seconds, with no ads? You can ask for references and links and it will give those to you too. I don't think I've ever been given a link to an SEO content farm, where as with Google search its the entire page. Google Search feels like Yahoo was (maybe even worse) right before it died and was replaced with Bing.

deanCommie · a year ago
This still happens all the time.

* I search a keyword * I see a google result * I see the keyword IN THE PREVIEW on Google * I click on the link * No keyword

And this isn't hidden SEO spam stuff, it was literally removed. The cache doesn't match the live result.

No recourse.

iggldiggl · a year ago
Another annoying scenario is when the search result isn't the actual page/article/… itself, but only a snippet within a site's own (paginated) index.

At the time Google had indexed that page, the article preview you were looking for was maybe on page 5 of that index, but by the time you're arriving, it might have moved to page 11 because of all the additional content that got added since then.

With online shops it's even worse, because there items get both added and possibly removed again, plus the default ordering usually isn't strictly chronologically but some sort of popularity-based or whatever algorithm, so something that originally was indexed on page 5 of the catalogue might by now be on page 2 or on page 12 or it might have been dropped from the inventory altogether.

EasyMark · a year ago
LLM… no ads…. *For now
account42 · a year ago
*That you notice.

Assuming that the output isn't biased towards the operator's interests is naive.

AbstractH24 · a year ago
I’m honestly less worried about ads that I am pay for preferential treatment
neop1x · a year ago
Google index sometimes also contain content which is under paywall or cookiewall. Two major sites in Czechia started implementing cookie walls, which is against GDPR but our local office for data privacy is not acting so it seems they are probably paid by those websites...
cyberax · a year ago
I used cache a lot, not just to view sites, but see the text versions of PDF and Word documents. RIP.
bjord · a year ago
oh, wow, same! this comment just made me realize that some of my older projects will no longer work after this
ThinkBeat · a year ago
I would presume Google still has all this data. They just will not let anyone else use it.

Could this be an advantage that Google can use to train their models on but others won't have access?

Google wants it to be more difficult to notice rewrites? Journalists to often have found valuable information with it?

selectodude · a year ago
I feel like the internet archive has taken a lot of that sort of use off of Google.

Unrelated: Google should probably think about a sizable donation to the Internet archive.

amorfusblob · a year ago
Some kind of collaboration appears to be happening between the two https://blog.archive.org/2024/09/11/new-feature-alert-access...
qingcharles · a year ago
They should donate all their saved data from Google Cache too.
zepearl · a year ago
> I would presume Google still has all this data. ...

Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.

> Could this be an advantage that Google can use to train their models on but others won't have access?

Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?

advisedwang · a year ago
As I understand it, Google does a decent amount of rendering of a page before indexing; this a) allows it to index content loaded by JS and b) prevents some ways spammers show Google different content from users. Perhaps Google's main way of storing a page no longer matches something that can be easily served as a cache page. This might be a way to remove a legacy copy of each page and reduce storage costs.
account42 · a year ago
> prevents some ways spammers show Google different content from users.

Google obviously hasn't cared about that for a long time.

lofaszvanitt · a year ago
Just with youtube, the surface area of these services is getting smaller and smaller and you get less and less. Too much optimization to the detriment of users. All the while search is still rooted in 90s concepts and only serves as a money making thing.
bigstrat2003 · a year ago
I am genuinely surprised to learn that it even still existed. I'm pretty sure it's been years since I have seen a Google result which actually had a cached version for me to pull up.
JonChesterfield · a year ago
One fewer reason to use Google search. Solid effort killing the money printer all around.
karlzt · a year ago
One more reason to not use Google search, I don't remember when it was the last time I used it, perhaps like twelve years ago.
silverliver · a year ago
Do any other versions provide cached versions of the pages they crawl? Far too many sites preform shinanigins based on geoip/ua.

Yandex, DuckDuckGo, and BraveSearch, please provide cache the pages you crawl and make them available to your users.

sandyarmstrong · a year ago
This was really useful when looking for product support, as companies regularly pull down or move around pages on their website. Seeing the version of a page at the time google associated it as a result was something I did all the time.