> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward
I find this to be a particularly interesting problem in this whole debacle.
Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.
I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.
Now imagine a myriad fake blogs using ChatGPT under the hood to generate blog entries explaining how to solve often wanted problems, and that then being spidered and fed into ChatGPT 2.0. Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.
I think image generation should be less vulnerable to this since all images need tagging to be useful, "ai generated" is a common tag that can be used to exclude reingesting old outputs, and also because with artwork precision doesn't matter so much. If people like the results, then it doesn't matter that much that something isn't drawn realistically.
> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.
This is exactly what I don't like about Copilot, maybe even more than the IP ethics of it. If it really succeeds, it's going to have a feedback loop that amplifies its own code suggestions. The same boilerplate-ish kind of code that developers generate over and over will get calcified, even if it's suboptimal or buggy. If we really need robots to help write our code, to me that means we don't yet have expressive enough languages.
but enterprise FizzBuzz is a demonstration that exactly that phenomenon will happen without AI, merely with books, youtube videos, or blogs (the calcification) and cargo culting (the lazy application).
As someone in SEO, I've been pretty disgusted by the desire for site owners to want to use AI-generated content. There are various opinions on this, of course, but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.
Love the idea of having AI-Free declarations of content as it could / should help to differentiate organic content from generated content. It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.
Curious, isn't SEO the thing that ruined search to a big extent. A 1000 word article where the actual answer needs a fraction of that size. Or interesting content buried because it is not "SEO optimized". Or companies writing blog content and making it look helpful while actually shilling and highlighting their own product. Plus tons of other things.
So now you need something like ChatGPT to cut through the noise?
I don't see the point. There's lots of old content out there that won't get tagged, so lacking the tag doesn't mean it's AI generated. Meanwhile people abusing AI for profit (eg, generating AI driven blogs to stick ads on them) wouldn't want to tag their sites in a way that might get them ignored.
>but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.
If the owner of an SEO site wants to use AI for "content generation", doesn't that mean they didn't care about the human-generated content in the first place?
Seems like a choice between garbage and slightly more expensive garbage. What is interesting or organic about that? Back in the day, people used to put things on their websites because they cared about it and wanted to say it.
Don't worry, it won't be small shops doing it. It will be the majors, if that's where the money is.
To quote Yan LeCun:
Meta will be able to help small businesses promote themselves by automatically producing media that promote a brand, he offered.
"There's something like 12 million shops that advertise on Facebook, and most of them are mom and pop shops, and they just don't have the resources to design a new, nicely designed ad," observed LeCun. "So for them, generative art could help a lot."
How could it possibly help unless there were some independent verification mechanism though? If there's a motivation to lie about the content being "organically generated" because that's what search users prefer to find, then clearly people will.
And it's hard to imagine what that verification process would look like given current technology.
What about AI-assisted writing? e.g. improving style, grammar, readability, making explanations clearer and better structured? especially for non-native writers this is a challenge and not many can hire an editor or even a proofreader. I wonder if such use gets “penalized” by search engines the same way AI-generated content might?
> I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.
So the next question has to be: Was this still the right answer?
I've personally had plenty of instances in my programming career where the code I was working on really needed functions which were best shopped out to a common API.
To avoid interrupting my flow and to better inform the API I'd be designing for this, I just continued to write as if the API did exist. Then I went on to implement the functions that were needed.
Perhaps the bot was right to presume that there should be an API for this. You might even be able to then prompt ChatGPT to create each of the functions in that API.
Exactly, that there is an end to the rabbit hole is a limitation of today's models.
If something does not exist, it should be generated on the spot. GPT5 should check for the existence of an API and if it exists, test and validate it. If it fails tests or doesn't exist, create it.
Well, this is ChatGPT, not Copilot, so I'd assume that OP was looking for a snippet using a public library rather than an internal API. In that context, suggesting you use an API that doesn't exist is just wrong.
I've definitely done this with Copilot, though—it will suggest an API that doesn't actually exist but logically should in order to be consistent, and I'll go create it.
>I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.
Yes! I remember generating a seemingly reasonable R script except that the library that it called to do most of the work didn't exist! It was like code from an alternate dimension!
I asked if there where any Open Source libraries that implemented a certain algorithm. It gave me links to 3 different GitHub repos, none of which existed.
Can confirm that this happened to me when I asked ChatGPT to generate a parallax effect in Qt/QML. It simply used a QML Elemened with the name Parallax.
Yeah, a few times when I ask for a reference to something outlandish, it generates a perfectly realistic looking paper alongside a doi link, that’s completely made up. Both the paper and the link link do not exist!
> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward
Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.
One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it?
Maybe it's the sentimental value, empathy, fidelity?
If an AI had written Mozart's Requiem would it lessen its interest, its beauty?
I don't think AI has to be low-quality for GP's concern to be valid.
Humans get inputs from a large variety of sources, but if an AI's input is just text, then there's the potential for AI's input to mostly consist of its prior output. Iterate this, and its model could gradually diverge from the real world.
The equivalent in human society is groupthink, where members of a subgroup get most of their information from the same subgroup and end up believing weird things. We can counter that by purposely getting inputs from outside of our group. For a text-model AI, this means identifying text that wasn't produced by AI, as the article suggests.
> Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.
To get away from that we'd have to dramatically change our approach. The LLMs we have are trained on as much content as possible and essentially average out the style of their training data. What it writes reads like a B-grade high school essay because that is what you get when you average all the writing on the internet.
It's not obvious to me that a creative approach that boils down to "pick the most likely next word given the context so far" can avoid sounding bland.
>One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it?
> If an AI had written Mozart's Requiem would it lessen its interest, its beauty?
I think it's about intent. Art is interesting and beautiful to us because there is an undeniable human intent in creating it and a vision behind it.
ChatGPT and DALL-E are pretty cool but I think until AI get it's own intent and goals it's pretty fair to try to separate human art and AI art.
Yes! There’s a common problem where people think that an ecosystem is infinite, or at least sufficiently large, when it’s not. We’ve done similar with dumping in the ocean, and now we’ve all got plastics in our blood, and we assumed soil quality was a given over time, too. AI content released into the wild will be consumed by AI; how can it not be? You’ve got a system which can produce content at a rate several orders of magnitude higher a human, of course the content ecosystem will be dominated by AI-generated content, so of course the quality of content generated by AI systems, which rely on non-AI-generated content to train, will go down over time.
I feel like we are nearing peak "uncurated" content, both for humans and machines. Humans are still grappling with our novel abundance problems.
As we move forward, suspect we will see an increase in curation services and AI models will do more with less. You can bootstrap a productive adult human on an almost infintismal slice of the training sets we are using for the current gen of AI, can't imagine future approaches are going to need such an unbounded input to get better results - but might be wrong!
If content is curated for its quality, whether or not it's AI generated (or assisted) doesn't matter.
We’d have to adjust capitalism to deal with the “novel abundance” problems. Most of the drive for novel content/audiences is simply to decide which people get a cut of the revenue (and/or audience).
If we focused on quality and stopped caring about who gets paid for what I suspect that not only would we have better quality overall but we’d also push the boundaries much faster thus making things even more interesting.
I think that in addition to influencing future models, AI content will also influence how humans think and write. People will start ironically and unironically copying GPT's style in their own writing, causing human produced content to increasing resemble AI content.
High school students that are prohibited from using AI for their essays will have a bad time. Even if they don't use AI chatbots themselves, they will unknowingly cite sources that were written by AI, or were written by someone who learned about the topic by asking ChatGPT.
Hmm, forgetting natural language for a moment and instead considering programming languages: it’s pretty easy to generate nonsense but semi-plausible looking ASTs without the help of AI. Could this be used to attack GitHub’s copilot?
Step 1. Release a tool that generates nonsense code across a thousand repositories, and allow anybody to publish crap to GitHub.
Step 2. Copilot trains on those nonsensical repositories because it can’t distinguish them from the real thing.
Imagine this as a security attack vector. Instead of nonsense, spam a bunch of repos with code that does a specific thing but in a very hard to understand way. Then add in a small piece of very hard to understand, but legit looking malicious code. Copilot trains on it and then starts feeding it to developers around the world. Probably easier ways to achieve this, but interesting to think about.
Those blogs already exist. Pretty much 90% of the results I see in Google for non-technical household related queries. Just incoherent rambling that sounds plausible but is complete nonsense.
> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward
Not necessarily all AI contents are bad and all human contents are good. We need a way to separate the good from the bad, not AI from human, and it might be impossible to do 100% correct anyway.
I think I would compare it to Stack Overflow. Some of the solutions do exist there, but not all are applicable to the use case or the exact circumstances the person asks there and yet the prompt used by AI would remain the same. SO has its rating system, but it has the same issue as the sentence above. From that perspective, we have identified potentially good human output ( assuming it wasn't already pollinated with AI output, which seems less and less likely ) that should only be accessible by humans and we would need a separate forum for bad AI output ( that should be verified by humans as bad but maybe only be accessible by AI once verified ).
I am just spitballing. I do not really have a solution in mind. It just sounds like an interesting problem going forward.
> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits?
No, because models are already trained like that. Datasets for large models are too vast for a human to even know what's inside, let alone label them manually. So instead they are processed (labeled, cropped, etc) by other models, with humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.
So it's actually the opposite - the hybrid bootstrapping approach that combines human curation and ML labeling of bulk low-quality data typically delivers far better results than training on a small but 100% manual dataset.
> They are processed by other models, humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.
A great description of what actually happens when you deal with massive datasets. One way to inspect a large dataset is to cluster it, and then look at just a few samples from each cluster, to get an overview.
Okay for image models I think humans could help a lot more than we give them credit for. We can read and parse images WAY faster than you might think.
What if we just crowdsource and have a new Folding@home protein thing but this time it’s for classifying data sets? LAION-5B has 5 billion image text pairs, if we got 10,000 people together that’d just be… 100,000 per person which would take… awhile but not forever. Humans can notice discrepancies super quickly. Like a slide show display the image and the text pair at a speed set by the user, and pause and tweak ones that are outright wrong or shitty.
Boom, refined image set.
Maybe? I’m looking at the LAION-5B example sets on their website and it seems to literally be this simple. A lot of the images seemed pretty poorly tagged. You get a gigantic manually tagged data set, at least for image classification.
I assume at some point, ChatGPT needs some kind of text ranking. Popular texts are usually correct (content and presentation) and useful, so they should rank higher. At some point, low-quality texts are filtered out. Personally, I don't care if a text is written by a human or a machine as long as it's good.
> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.
I just had my eyes opened reading that, because humans also do exactly that, inadvertently.
This isn't an issue, because it's possible to add prose quality and content accuracy scores to training data and train the model to predict those quantities during generation, which would allow you to condition the generation on high prose quality/accuracy. It just requires a small update to the model, and a shit ton of data set annotation time.
Likewise, images can be scored for aesthetics and consistency and models updated to predict and condition in the same way.
How would you score them at scale without training some model to differentiate real vs. AI content? If you need to train such a model, where would you get the data from?
There will likely be selective pressure from human interaction with the data to curate good content above bad.
After all, we had the issue of millions of auto-generated bad pages in the web 1.0 SEO days. Search engines addressed it by figuring out how to rely more heavily on human behavior signals as an indication of value of data.
The thing that concerns me is that we may end up with a downward trend in accuracy.
If AI writes the bulk of the content, how long will it be before people simply do not put in the work to make sure things are true or put in the work to discover and understand new true things?
>Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.
I feel like a similar thing already happened with YouTube recommendations
To your point, anecdotally, the system is heavily gamed. The other day, I saw reviews for a restaurant pop up that did not quite open yet. Either reviewers got a sneak peek behind the chef's curtain or those reviews are not quite true.
Sadly, word of mouth again becomes defacto the only semi-reliable way to separate crap from non-crap and even that comes with its own set of issues and attempts at gaming.
There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.
One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?
And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?
With all of that out of the way, yes, I would rather read text produced by human beings, not because of its quality--the AI knows, sometimes humans can't help themselves and just keep writing the same thing over and over, specially when it comes to fiction--but just to defend human dominance.
>What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?
We could get this whole discussion back to some semblance of sanity if we stopped calling any form of remotely complicated automation "AI". The term might as well be meaningless now.
Nothing about any of all these "AIs" is intelligent in the sense of the layman's understanding of artificial intelligence, let alone intelligence of biological and philosophical schools of thought.
> There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.
Frankly I had the same thought writing it :D
It's more of a stake in the ground sort of a thing I guess?
What I really want is somebody saying "hey, there is an open standard already here" so I can use it.
The idea has some legs, but they are weak for the many reasons pointed
out to me by fair criticism of "digital veganism". The main one is
that labelling is one small part of quality. Tijmen Schep in his 2016
"Design My Privacy" [1] proposed some really cool ideas around quality
and trustworthiness labelling of IoT/mobile devices, but ran into the
same issues. Responsibility ultimately lies with the consumer, and so
long as consumers remain uneducated as to why low quality is harmful,
and cannot verify the provenance of what they consume or the harmful
effects, nothing will change.
Right now we seem to be at the stage of "It's just McDonald's/KFC for
data - junk food is convenient, cheap and not a problem - therefore
mass production generative content won't be a problem".
The food analogy is powerful, but has limits, and I urge you to dig
into Digital Vegan [2] if you want to take it further.
>One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?
>And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?
The line is drawn at human ownership/responsibility. A piece of content can be 'AI tainted' or '100% produced by AI', what makes the difference is if a human takes the responsibility of the end product or not.
Responsibility and ownership always lies with the humans. Even supposedly 100% AI generated content is still coming from a process started and maintained by humans. Currently also prompted by a human.
The humans running those processes can attempt to deny ownership or responsibility if they so choose but whenever it matters such as in law or any other arena dealing with liability or ownership rights, the humans will be made to own the responsibility.
Same as for self-driving cars. We can debate about who the operator is and to what extent the manufacturers, the occupants, or the owners are responsible for whether the car causes harm but we'll never try to punish the car while calling all humans involved blameless. The point of holding people responsible for outcomes and actions is to drive meaningful change in human behaviors in order to reduce harms and encourage human flourishing.
In terms of ownership and intellectual property, again the point of even having rules is to manage interactions between humans so we can behave civilly towards each other. There can be no meaningful category of content produced "100%" by AI unless AI become persons under the law or as considered by most humans.
If an AI system can ever truly produce content on its own volition, without any human action taken to make that specific thing happen, then that system would be a rational actor on par with other persons and we'll probably begin the debate over whether AI systems should be treated as people in society and under the law. That may even be a new category distinct from human persons such as it is with the concept of corporate persons.
> ... yes, I would rather read text produced by human beings, not because of its quality ... (snip) ... but just to defend human dominance.
One could make a strong argument that defending moral principles is preferable to preferring the underlying creative force to have a particular biological composition.
As an example, I don't want a system to incentivize humans kept as almost-slaves to retype AI generated content.
How can one tell the difference between all the gradations of "completely" human generated to not?
> One of my favorite products is “100% Fat-Free Pickled Cucumbers Fit (Gluten Free), “ which I once saw at the grocery store.
On my first fligh to the US, in the 90's, a rather obese lady in the row in front of me asked the flight attendant: "Excuse me. Do you have fat-free water?"
The flight attendant hesitated a split second, her face not moving an inch. Then she smiled and replied: "We certainly have fat-free water, madam. I fetch you a bottle straight away."
A few years ago in a hotel in London I had complementary water bottles on the night stand. The label said "Organic Water from Scotland". I was like: uhh organic water from Scotland is probably cattle piss. I prefer inorganic, non-bio water.
The long term impact of the ease of generating low nutrition digital content using language models may be that people put down their devices and return to the real world. We’re already far down that path with the existing internet where most content is generated for SEO.
Anything you’re consuming on the internet or even on a TV may just be random noise generated by some model so why waste your precious time consuming it?
On the flip side why waste your time producing content if it’s going to be drowned in a sea of garbage mashed together by some language model?
> The long term impact of the ease of generating low nutrition digital content using language models may be that people put down their devices
The problem is people don't always make the wise decision. Evidence: the junk food industry is alive and kicking.
Some people will disconnect from devices, but others may just say "this is the way things are now" and adjust themselves to the flavor of junk content.
Why are you assuming that it will make writing worse not better?
Just because it can be used by non-experts to create crappy written work, it can also be used by people who work with it to augment and improve their existing writen work.
Sorry, but all I can think of after reading this blog post is the evil bit RFC: https://www.ietf.org/rfc/rfc3514.txt which has had just as much effect on internet security as this proposal will have on controlling ai generated content.
This post reminds me of Samuel Butler's novel, Erewhon.
"Assume for the sake of argument that conscious beings have existed for some twenty million years: see what strides machines have made in the last thousand! May not the world last twenty million years longer? If so, what will they not in the end become? Is it not safer to nip the mischief in the bud and to forbid them further progress?"
That reminds me of the epilogue to HG Wells Time Traveler:
“ He, I know—for the question had been discussed among us long before the Time Machine was made—thought but cheerlessly of the Advancement of Mankind, and saw in the growing pile of civilization only a foolish heaping that must inevitably fall back upon and destroy its makers in the end. If that is so, it remains for us to live as though it were not so.”
That passage has haunted me since. I often wonder if that is the answer to the Fermi paradox. Civilization might be but a brief spark in the long night, separated from others by both time and distance insurmountable.
2. By someone who has consumed AI-generate content
3. With a ~2000-era spellchecker
4. By someone using ~2020-era neural speech-to-text software
5. With a ~2020-era spellchecker
6. By someone with an AI in the traditional editor role (reads it, gives detailed feedback)
7. By a human and an AI working together where the AI generates everything initially but the human fixes things and nothing goes out without human endorsement.
I'd probably draw the line at 7, but you could also argue for 6 or even 5.
I've always maintained that for any food product labeled "Home Style" or "Home-Made Flavor", the product must also feature a photograph of the factory floor where the product is made.
I was thinking about this exact problem a few days ago when I created a site hosting poems that were either 100% AI written, or 100% Human.
https://news.ycombinator.com/item?id=34472478
Then I asked people to guess the authorship. Amazingly, only 70% of the time the guess people make is correct. https://random-poem.com/
why : Is this Poem written by or by ? Guess & Click.
I'm guessing it will get even harder to tell as the AI improves further down the road.
Good Question. However this is 100% ChatGPT. Perhaps it knows that Weird Al intentionally misspells words in his work, hence it has intentionally introduced this typo. Which makes this type of AI even more awesome.
--edit
.. weird:
I asked ChatGPT if there is a typo in this poem. This is what it responded with:
ChatGPT: It appears that there is an intentional typo in the first line of the poem "Weird Al, oh Weird Yankovic Al" instead of "Weird Al Yankovic". Yankovic being the surname of the artist, this addition can be seen as a playful and humorous way to refer to the artist, and give the poem a personal touch.
I find this to be a particularly interesting problem in this whole debacle.
Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.
I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.
Now imagine a myriad fake blogs using ChatGPT under the hood to generate blog entries explaining how to solve often wanted problems, and that then being spidered and fed into ChatGPT 2.0. Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.
I think image generation should be less vulnerable to this since all images need tagging to be useful, "ai generated" is a common tag that can be used to exclude reingesting old outputs, and also because with artwork precision doesn't matter so much. If people like the results, then it doesn't matter that much that something isn't drawn realistically.
This is exactly what I don't like about Copilot, maybe even more than the IP ethics of it. If it really succeeds, it's going to have a feedback loop that amplifies its own code suggestions. The same boilerplate-ish kind of code that developers generate over and over will get calcified, even if it's suboptimal or buggy. If we really need robots to help write our code, to me that means we don't yet have expressive enough languages.
Love the idea of having AI-Free declarations of content as it could / should help to differentiate organic content from generated content. It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.
So now you need something like ChatGPT to cut through the noise?
And what are the consequences for lying?
I'm sure you can appreciate that such an initiative would be wholesale abused from day 1.
If the owner of an SEO site wants to use AI for "content generation", doesn't that mean they didn't care about the human-generated content in the first place?
Seems like a choice between garbage and slightly more expensive garbage. What is interesting or organic about that? Back in the day, people used to put things on their websites because they cared about it and wanted to say it.
To quote Yan LeCun:
https://www.zdnet.com/article/chatgpt-is-not-particularly-in...So the next question has to be: Was this still the right answer?
I've personally had plenty of instances in my programming career where the code I was working on really needed functions which were best shopped out to a common API. To avoid interrupting my flow and to better inform the API I'd be designing for this, I just continued to write as if the API did exist. Then I went on to implement the functions that were needed.
Perhaps the bot was right to presume that there should be an API for this. You might even be able to then prompt ChatGPT to create each of the functions in that API.
I've definitely done this with Copilot, though—it will suggest an API that doesn't actually exist but logically should in order to be consistent, and I'll go create it.
Yes! I remember generating a seemingly reasonable R script except that the library that it called to do most of the work didn't exist! It was like code from an alternate dimension!
Maybe it was? Haha. ChatGPT has seen some things and knows something we don’t!
Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.
One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it?
Maybe it's the sentimental value, empathy, fidelity?
If an AI had written Mozart's Requiem would it lessen its interest, its beauty?
Humans get inputs from a large variety of sources, but if an AI's input is just text, then there's the potential for AI's input to mostly consist of its prior output. Iterate this, and its model could gradually diverge from the real world.
The equivalent in human society is groupthink, where members of a subgroup get most of their information from the same subgroup and end up believing weird things. We can counter that by purposely getting inputs from outside of our group. For a text-model AI, this means identifying text that wasn't produced by AI, as the article suggests.
To get away from that we'd have to dramatically change our approach. The LLMs we have are trained on as much content as possible and essentially average out the style of their training data. What it writes reads like a B-grade high school essay because that is what you get when you average all the writing on the internet.
It's not obvious to me that a creative approach that boils down to "pick the most likely next word given the context so far" can avoid sounding bland.
I think it's about intent. Art is interesting and beautiful to us because there is an undeniable human intent in creating it and a vision behind it.
ChatGPT and DALL-E are pretty cool but I think until AI get it's own intent and goals it's pretty fair to try to separate human art and AI art.
Will be a great question when we can't tell the difference.
As we move forward, suspect we will see an increase in curation services and AI models will do more with less. You can bootstrap a productive adult human on an almost infintismal slice of the training sets we are using for the current gen of AI, can't imagine future approaches are going to need such an unbounded input to get better results - but might be wrong!
If content is curated for its quality, whether or not it's AI generated (or assisted) doesn't matter.
If we focused on quality and stopped caring about who gets paid for what I suspect that not only would we have better quality overall but we’d also push the boundaries much faster thus making things even more interesting.
High school students that are prohibited from using AI for their essays will have a bad time. Even if they don't use AI chatbots themselves, they will unknowingly cite sources that were written by AI, or were written by someone who learned about the topic by asking ChatGPT.
Step 1. Release a tool that generates nonsense code across a thousand repositories, and allow anybody to publish crap to GitHub.
Step 2. Copilot trains on those nonsensical repositories because it can’t distinguish them from the real thing.
Step 3. Copilot begins to generate similar crap.
Not necessarily all AI contents are bad and all human contents are good. We need a way to separate the good from the bad, not AI from human, and it might be impossible to do 100% correct anyway.
I am just spitballing. I do not really have a solution in mind. It just sounds like an interesting problem going forward.
No, because models are already trained like that. Datasets for large models are too vast for a human to even know what's inside, let alone label them manually. So instead they are processed (labeled, cropped, etc) by other models, with humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.
So it's actually the opposite - the hybrid bootstrapping approach that combines human curation and ML labeling of bulk low-quality data typically delivers far better results than training on a small but 100% manual dataset.
A great description of what actually happens when you deal with massive datasets. One way to inspect a large dataset is to cluster it, and then look at just a few samples from each cluster, to get an overview.
What if we just crowdsource and have a new Folding@home protein thing but this time it’s for classifying data sets? LAION-5B has 5 billion image text pairs, if we got 10,000 people together that’d just be… 100,000 per person which would take… awhile but not forever. Humans can notice discrepancies super quickly. Like a slide show display the image and the text pair at a speed set by the user, and pause and tweak ones that are outright wrong or shitty.
Boom, refined image set.
Maybe? I’m looking at the LAION-5B example sets on their website and it seems to literally be this simple. A lot of the images seemed pretty poorly tagged. You get a gigantic manually tagged data set, at least for image classification.
I just had my eyes opened reading that, because humans also do exactly that, inadvertently.
Likewise, images can be scored for aesthetics and consistency and models updated to predict and condition in the same way.
After all, we had the issue of millions of auto-generated bad pages in the web 1.0 SEO days. Search engines addressed it by figuring out how to rely more heavily on human behavior signals as an indication of value of data.
If AI writes the bulk of the content, how long will it be before people simply do not put in the work to make sure things are true or put in the work to discover and understand new true things?
I feel like a similar thing already happened with YouTube recommendations
Have you googled for reviews on toaster ovens recently?
Sadly, word of mouth again becomes defacto the only semi-reliable way to separate crap from non-crap and even that comes with its own set of issues and attempts at gaming.
One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?
And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?
With all of that out of the way, yes, I would rather read text produced by human beings, not because of its quality--the AI knows, sometimes humans can't help themselves and just keep writing the same thing over and over, specially when it comes to fiction--but just to defend human dominance.
We could get this whole discussion back to some semblance of sanity if we stopped calling any form of remotely complicated automation "AI". The term might as well be meaningless now.
Nothing about any of all these "AIs" is intelligent in the sense of the layman's understanding of artificial intelligence, let alone intelligence of biological and philosophical schools of thought.
Frankly I had the same thought writing it :D
It's more of a stake in the ground sort of a thing I guess? What I really want is somebody saying "hey, there is an open standard already here" so I can use it.
Right now we seem to be at the stage of "It's just McDonald's/KFC for data - junk food is convenient, cheap and not a problem - therefore mass production generative content won't be a problem".
The food analogy is powerful, but has limits, and I urge you to dig into Digital Vegan [2] if you want to take it further.
[1] https://www.tijmenschep.com/design-my-privacy/
[2] https://digitalvegan.net
>And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?
The line is drawn at human ownership/responsibility. A piece of content can be 'AI tainted' or '100% produced by AI', what makes the difference is if a human takes the responsibility of the end product or not.
The humans running those processes can attempt to deny ownership or responsibility if they so choose but whenever it matters such as in law or any other arena dealing with liability or ownership rights, the humans will be made to own the responsibility.
Same as for self-driving cars. We can debate about who the operator is and to what extent the manufacturers, the occupants, or the owners are responsible for whether the car causes harm but we'll never try to punish the car while calling all humans involved blameless. The point of holding people responsible for outcomes and actions is to drive meaningful change in human behaviors in order to reduce harms and encourage human flourishing.
In terms of ownership and intellectual property, again the point of even having rules is to manage interactions between humans so we can behave civilly towards each other. There can be no meaningful category of content produced "100%" by AI unless AI become persons under the law or as considered by most humans.
If an AI system can ever truly produce content on its own volition, without any human action taken to make that specific thing happen, then that system would be a rational actor on par with other persons and we'll probably begin the debate over whether AI systems should be treated as people in society and under the law. That may even be a new category distinct from human persons such as it is with the concept of corporate persons.
One could make a strong argument that defending moral principles is preferable to preferring the underlying creative force to have a particular biological composition.
As an example, I don't want a system to incentivize humans kept as almost-slaves to retype AI generated content.
How can one tell the difference between all the gradations of "completely" human generated to not?
On my first fligh to the US, in the 90's, a rather obese lady in the row in front of me asked the flight attendant: "Excuse me. Do you have fat-free water?"
The flight attendant hesitated a split second, her face not moving an inch. Then she smiled and replied: "We certainly have fat-free water, madam. I fetch you a bottle straight away."
I suppose you wrote the 90's lest our mental image was of a 2020's rather obese lady? The woman probably would be skinny today
Anything you’re consuming on the internet or even on a TV may just be random noise generated by some model so why waste your precious time consuming it?
On the flip side why waste your time producing content if it’s going to be drowned in a sea of garbage mashed together by some language model?
The problem is people don't always make the wise decision. Evidence: the junk food industry is alive and kicking.
Some people will disconnect from devices, but others may just say "this is the way things are now" and adjust themselves to the flavor of junk content.
Just because it can be used by non-experts to create crappy written work, it can also be used by people who work with it to augment and improve their existing writen work.
To my mind AI is a general purpose technology: https://en.wikipedia.org/wiki/General-purpose_technology
I guess using this mental model, what you are worried about is the equivilent to pollution?
Did the printing press also increase the amount of crap in circulation?
"Assume for the sake of argument that conscious beings have existed for some twenty million years: see what strides machines have made in the last thousand! May not the world last twenty million years longer? If so, what will they not in the end become? Is it not safer to nip the mischief in the bud and to forbid them further progress?"
“ He, I know—for the question had been discussed among us long before the Time Machine was made—thought but cheerlessly of the Advancement of Mankind, and saw in the growing pile of civilization only a foolish heaping that must inevitably fall back upon and destroy its makers in the end. If that is so, it remains for us to live as though it were not so.”
That passage has haunted me since. I often wonder if that is the answer to the Fermi paradox. Civilization might be but a brief spark in the long night, separated from others by both time and distance insurmountable.
kicks big pile of old books
“it remains for us to live as though it were not so” is a wonderful line.
1. Before 1970
2. By someone who has consumed AI-generate content
3. With a ~2000-era spellchecker
4. By someone using ~2020-era neural speech-to-text software
5. With a ~2020-era spellchecker
6. By someone with an AI in the traditional editor role (reads it, gives detailed feedback)
7. By a human and an AI working together where the AI generates everything initially but the human fixes things and nothing goes out without human endorsement.
I'd probably draw the line at 7, but you could also argue for 6 or even 5.
The history of the term "handmade" and discussions about what to allow on Etsy come to mind: https://whileshenaps.com/2013/10/etsy-redefines-handmade-aut...
Then I asked people to guess the authorship. Amazingly, only 70% of the time the guess people make is correct. https://random-poem.com/
why : Is this Poem written by or by ? Guess & Click.
I'm guessing it will get even harder to tell as the AI improves further down the road.
https://random-poem.com/Weird
"ridicolously"
--edit
.. weird:
I asked ChatGPT if there is a typo in this poem. This is what it responded with:
Me: Does this poem have a typo? https://random-poem.com/weird
ChatGPT: It appears that there is an intentional typo in the first line of the poem "Weird Al, oh Weird Yankovic Al" instead of "Weird Al Yankovic". Yankovic being the surname of the artist, this addition can be seen as a playful and humorous way to refer to the artist, and give the poem a personal touch.