I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.
It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
> I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.
Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.
In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.
SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.
This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).
I think it's an arms race, but it's an open question who wins.
For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).
Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.
> The training data is weighted by a quality metric
At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.
Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.
Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.
Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.
Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!
It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?
I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.
We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.
I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.
Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:
2017: Invention of transformer architecture
June 2018: GPT-1
February 2019: GPT-2
June 2020: GPT-3
March 2022: GPT-3.5
November 2022: ChatGPT
You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.
That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.
The sources are just aggregated. The source doesn't change.
The new stuff generated does (and this is honestly already captured).
This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.
Making resources like wordfreq more visible won't exacerbate any of these concerns.
Yeah pay an illustrator if this is important to you.
See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.
:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.
Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.
I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).
The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").
Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.
When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.
Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.
It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.
I don't know, individually finely tuned addictive content served as real time interactive feedback loops is an other level of propaganda and attention capture tool than largest common denominator of the general crowd served as static passive content.
tangentially related, but Marx also predicted that crypto and NFT's would exist in 1894 [1] and I only bring it up because its kind of wild how we keep crossing these "red lines" without even blinking. It's like that meme:
Sci-fi author:
I created the Torment Nexus to serve as a cautionary tale...
Tech Company:
Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"
I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.
As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.
On Amazon, you used to be able to search the reviews and Q&A section via a search box. This was immensely useful. Now, that search box first routes your search to an LLM, which makes you wait 10-15 seconds while it searches for you. Then it presents its unhelpful summary, saying "some reviews said such and such", and I can finally click the button to show me the actual reviews and questions with the term I searched.
This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.
I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.
Sounds like your laptop is wholly out of date, you need to buy the next generation of laptops on Amazon that can handle the modern SEO load. I recommend the:
LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business
Can vouch for this. It’s the first non-Google search alternative I’ve used that has 100% replaced Google. I don’t need Google as a fallback like I did with others.
I've been slowly detaching myself from the web for the past 10 years. These days I mostly build offline apps using native technologies. Those capabilities are still around. They just receded for a while because they'd gotten so polluted with toolbars and malware. But now the malware is on the other side, and native apps are cool again. If you know where to look. Here's my shingle: https://akkartik.name/freewheeling-apps
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.
> Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree
I suppose it is just Amazon problems. I have never lived in the area where Amazon is prevalent. Where I live, search engines still can't find synonyms or process misspellings.
"I don't think anyone has reliable information about post-2021 language usage by humans."
We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.
Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.
The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).
What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.
The revolutions shorten in time, seemingly exponentially.
Comparing the world of today to that of my childhood....
One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
$2.48 Eastern and Southern Africa (PIP)
$2.78 Sub-Saharan Africa (PIP)
$3.22 Western and Central Africa (PIP)
$3.72 India (rural)
$4.22 South Asia (PIP)
$4.60 India (urban)
$5.40 Indonesia (rural)
$6.54 Indonesia (urban)
$7.50 Middle East and North Africa (PIP)
$8.05 China (rural)
$10.00 East Asia and Pacific (PIP)
$11.60 Latin America and the Caribbean (PIP)
$12.52 China (urban)
And more generally:
$7.75 World
I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.
+100w chargers are one of the products I prefer to spend a little more on, so I get something from a company that knows it can be sued if they make a product that burns down your house or fries your phone.
Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!
> One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).
For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.
Democracy (and Republics) are thousands of year old. Computation is also quite old though it only sky-rocketed with electricity and semiconductors. This is not the first time the global world created a potential for exponential growth (I'll consider the Pharaohs and Roman empires to be ones).
There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.
> When I get asked if the person in a video is real, I still feel pretty confident to answer
I don't share your confidence in identifying real people anymore.
I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.
There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.
That is very true, but for now we have a baseline of videos that we either remember or that we remember key details of, like the persons in the video. I'm pretty sure if I watch The Primeagen or Tom Scott today, that they are real. Ask me in year, I might not be so sure anymore.
It's even worse than that. Most people have no idea how far CGI has come, and how easily it is wielded even by a couple of dedicated teens on their home computer, let alone people with a vested interest in faking something for some financial reason. People think they know what a "special effect" looks like, and for the most part, people are wrong. They know what CGI being used to create something obviously impossible, like a dinosaur stomping through a city, looks like. They have no idea how easy a lot of stuff is to fake already. AI just adds to what is already there. Heck, to some extent it has caused scammers to overreach, with things like obviously fake Elon Musk videos on YouTube generated from (pure) AI and text-to-speech... when with just a little bit more learning, practice, and amounts of equipment completely reasonable for one person to obtain, they could have done a much better fake of Elon Musk using special effects techniques rather than shoveling text into an AI. The fact that "shoveling text into an AI" may in another few years itself generate immaculate videos is more a bonus than a fundamental change of capability.
Even what's free & open source in the special effects community is astonishing lately.
I mean, it's already apparent to me that a lot of people don't have a basic process in place to detect fact from fiction. And it's definitely not always easy, but when I hear some of the dumbest conspiracy theories known to man actually get traction in our media, political figures, and society at large, I just have to shake my head and laugh to keep from crying. I'm constantly reminded of my favorite saying, "people who believe in conspiracy theories have never been a project manager."
Oh they definitely are. A lot of people are now calling out real photos as fake. I frequently get into stupid Instagram political arguments and a lot of times they come back with "yeah nice profile with all your AI art haha". It's all real high quality photography. Honestly, I don't think the avg person can tell anymore.
This video's worth a watch if you want to get a sense of the current state of things. Despite the (deliberately) clickbait title, the video itself is pretty even-handed.
It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"
I find issue with this statement as content was never a clean representation of human actions or even thought. It was always driven by editorials, SEO, bot remixing and whatnot that heavily influences how we produce content. One might even argue that heightened content distrust is _good_ for our society.
> Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.
Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.
Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????
On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).
But...
A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.
The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).
Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.
Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.
Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.
> Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance
Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)
Back in the day they did. I checked handbrake but now there's nothing specific about DVD compatibility there. I could have picked something like Super HQ 576p, and there's a good chance that would have sufficed, but old DVD players were extremely finicky about filenames, extensions, interlacing, etc. I didn't want to risk the DVD traveling half way across the world only to find that it's not playable.
It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.
In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.
SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
GOGI.
For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).
Deleted Comment
At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.
Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.
I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.
Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.
How do we know what content LLMs were fed? Isn't that a highly guarded secret?
Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?
The new stuff generated does (and this is honestly already captured).
This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
Their research and projects are great.
Making resources like wordfreq more visible won't exacerbate any of these concerns.
Deleted Comment
See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.
Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.
Edit: just the first one
Dead Comment
The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").
Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.
When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.
Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.
It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.
You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.
IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.
1. build a userbase, free product
2. once userbase get big enough, any new account requires a monthly fee, maybe $1
3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.
no ads, simple.
The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.
Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.
Sci-fi author:
I created the Torment Nexus to serve as a cautionary tale...
Tech Company:
Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"
1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm
As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.
This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.
Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...
Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/
This seems to bypass all of the LLM stuff for now.
I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.
The shitification of the web is real.
(The Agner Fog of cycling?)
LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business
Name rolls off the tongue doesn’t it
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.
To get to the milk you'll have to walk by 3 rows of chips and soda.
Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree
We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.
Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.
The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
https://www.nytimes.com/interactive/2024/09/09/technology/ai...
https://www.nytimes.com/interactive/2024/01/19/technology/ar...
These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).
What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.
The revolutions shorten in time, seemingly exponentially.
Comparing the world of today to that of my childhood....
One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
I'm not quite sure where this is all headed.
It really isn't. Have a look at daily median income statistics for the rest of the planet:
https://ourworldindata.org/grapher/daily-median-income?tab=t...
And more generally: I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!
Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).
For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.
There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.
Progress isn't inevitable. It's possible for knowledge to be lost and for civilization to regress.
The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity
I don't share your confidence in identifying real people anymore.
I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.
There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.
I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?
I see a lot of outrage around fake posts already. People want to believe bad things from the other tribes.
And we are going to feed them with it, endlessly.
Even what's free & open source in the special effects community is astonishing lately.
And it already happened, and no one pushed back while it was happening.
It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"
https://youtu.be/XeQ-y5QFdB4
Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.
Deleted Comment
Dead Comment
Deleted Comment
On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).
But...
A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.
The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).
Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.
Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.
Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.
Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)