Certified 100% AI-free organic content

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

I find this to be a particularly interesting problem in this whole debacle.

Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

Now imagine a myriad fake blogs using ChatGPT under the hood to generate blog entries explaining how to solve often wanted problems, and that then being spidered and fed into ChatGPT 2.0. Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.

I think image generation should be less vulnerable to this since all images need tagging to be useful, "ai generated" is a common tag that can be used to exclude reingesting old outputs, and also because with artwork precision doesn't matter so much. If people like the results, then it doesn't matter that much that something isn't drawn realistically.

eloisius · 3 years ago

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

This is exactly what I don't like about Copilot, maybe even more than the IP ethics of it. If it really succeeds, it's going to have a feedback loop that amplifies its own code suggestions. The same boilerplate-ish kind of code that developers generate over and over will get calcified, even if it's suboptimal or buggy. If we really need robots to help write our code, to me that means we don't yet have expressive enough languages.

sillysaurusx · 3 years ago

Devs can use their noggins to check whether copilot’s output is decent or not. It’s not a given that people will use it automatically.

smrtinsert · 3 years ago

Not really the case. I reject maybe 90٪ of copilot suggestions. If they leverage that I'm sure it will improve

LesZedCB · 3 years ago

but enterprise FizzBuzz is a demonstration that exactly that phenomenon will happen without AI, merely with books, youtube videos, or blogs (the calcification) and cargo culting (the lazy application).

rgrieselhuber · 3 years ago

As someone in SEO, I've been pretty disgusted by the desire for site owners to want to use AI-generated content. There are various opinions on this, of course, but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.

Love the idea of having AI-Free declarations of content as it could / should help to differentiate organic content from generated content. It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.

capybara_2020 · 3 years ago

Curious, isn't SEO the thing that ruined search to a big extent. A 1000 word article where the actual answer needs a fraction of that size. Or interesting content buried because it is not "SEO optimized". Or companies writing blog content and making it look helpful while actually shilling and highlighting their own product. Plus tons of other things.

So now you need something like ChatGPT to cut through the noise?

dale_glass · 3 years ago

I don't see the point. There's lots of old content out there that won't get tagged, so lacking the tag doesn't mean it's AI generated. Meanwhile people abusing AI for profit (eg, generating AI driven blogs to stick ads on them) wouldn't want to tag their sites in a way that might get them ignored.

And what are the consequences for lying?

dazc · 3 years ago

> 'It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.'

I'm sure you can appreciate that such an initiative would be wholesale abused from day 1.

password11 · 3 years ago

>but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.

If the owner of an SEO site wants to use AI for "content generation", doesn't that mean they didn't care about the human-generated content in the first place?

Seems like a choice between garbage and slightly more expensive garbage. What is interesting or organic about that? Back in the day, people used to put things on their websites because they cared about it and wanted to say it.

kurthr · 3 years ago

Don't worry, it won't be small shops doing it. It will be the majors, if that's where the money is.

To quote Yan LeCun:

   Meta will be able to help small businesses promote themselves by automatically producing media that promote a brand, he offered. 

   "There's something like 12 million shops that advertise on Facebook, and most of them are mom and pop shops, and they just don't have the resources to design a new, nicely designed ad," observed LeCun. "So for them, generative art could help a lot."

https://www.zdnet.com/article/chatgpt-is-not-particularly-in...

wizofaus · 3 years ago

How could it possibly help unless there were some independent verification mechanism though? If there's a motivation to lie about the content being "organically generated" because that's what search users prefer to find, then clearly people will. And it's hard to imagine what that verification process would look like given current technology.

gingerlime · 3 years ago

What about AI-assisted writing? e.g. improving style, grammar, readability, making explanations clearer and better structured? especially for non-native writers this is a challenge and not many can hire an editor or even a proofreader. I wonder if such use gets “penalized” by search engines the same way AI-generated content might?

megous · 3 years ago

Sure, let's just ignore the benefits of AI and pretend like it doesn't exist. That sounds like a great plan.

alpos · 3 years ago

> I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

So the next question has to be: Was this still the right answer?

I've personally had plenty of instances in my programming career where the code I was working on really needed functions which were best shopped out to a common API. To avoid interrupting my flow and to better inform the API I'd be designing for this, I just continued to write as if the API did exist. Then I went on to implement the functions that were needed.

Perhaps the bot was right to presume that there should be an API for this. You might even be able to then prompt ChatGPT to create each of the functions in that API.

dangom · 3 years ago

Exactly, that there is an end to the rabbit hole is a limitation of today's models. If something does not exist, it should be generated on the spot. GPT5 should check for the existence of an API and if it exists, test and validate it. If it fails tests or doesn't exist, create it.

lolinder · 3 years ago

Well, this is ChatGPT, not Copilot, so I'd assume that OP was looking for a snippet using a public library rather than an internal API. In that context, suggesting you use an API that doesn't exist is just wrong.

I've definitely done this with Copilot, though—it will suggest an API that doesn't actually exist but logically should in order to be consistent, and I'll go create it.

jhbadger · 3 years ago

>I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

Yes! I remember generating a seemingly reasonable R script except that the library that it called to do most of the work didn't exist! It was like code from an alternate dimension!

dagw · 3 years ago

I asked if there where any Open Source libraries that implemented a certain algorithm. It gave me links to 3 different GitHub repos, none of which existed.

gattilorenz · 3 years ago

It’s as if ChatGPT was behaving like a language model with no real connection or understanding of R… hmmmmm…

Kelteseth · 3 years ago

Can confirm that this happened to me when I asked ChatGPT to generate a parallax effect in Qt/QML. It simply used a QML Elemened with the name Parallax.

VeninVidiaVicii · 3 years ago

Yeah, a few times when I ask for a reference to something outlandish, it generates a perfectly realistic looking paper alongside a doi link, that’s completely made up. Both the paper and the link link do not exist!

PartiallyTyped · 3 years ago

I have had the same experience with Typescript and Python.

bamboozled · 3 years ago

I asked it to generate sample aws step function config and as far as I could tell it made up configuration parameters. I know it’s a language model.

jay-barronville · 3 years ago

> It was like code from an alternate dimension!

Maybe it was? Haha. ChatGPT has seen some things and knows something we don’t!

luxcem · 3 years ago

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.

One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it?

Maybe it's the sentimental value, empathy, fidelity?

If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

DennisP · 3 years ago

I don't think AI has to be low-quality for GP's concern to be valid.

Humans get inputs from a large variety of sources, but if an AI's input is just text, then there's the potential for AI's input to mostly consist of its prior output. Iterate this, and its model could gradually diverge from the real world.

The equivalent in human society is groupthink, where members of a subgroup get most of their information from the same subgroup and end up believing weird things. We can counter that by purposely getting inputs from outside of our group. For a text-model AI, this means identifying text that wasn't produced by AI, as the article suggests.

lolinder · 3 years ago

> Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.

To get away from that we'd have to dramatically change our approach. The LLMs we have are trained on as much content as possible and essentially average out the style of their training data. What it writes reads like a B-grade high school essay because that is what you get when you average all the writing on the internet.

It's not obvious to me that a creative approach that boils down to "pick the most likely next word given the context so far" can avoid sounding bland.

iliane5 · 3 years ago

>One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it? > If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

I think it's about intent. Art is interesting and beautiful to us because there is an undeniable human intent in creating it and a vision behind it.

ChatGPT and DALL-E are pretty cool but I think until AI get it's own intent and goals it's pretty fair to try to separate human art and AI art.

htrp · 3 years ago

> If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

Will be a great question when we can't tell the difference.

layer8 · 3 years ago

It’s not clear when and how we would reach that level of quality. It doesn’t seem very relevant to the present state of affairs.

roughly · 3 years ago

Yes! There’s a common problem where people think that an ecosystem is infinite, or at least sufficiently large, when it’s not. We’ve done similar with dumping in the ocean, and now we’ve all got plastics in our blood, and we assumed soil quality was a given over time, too. AI content released into the wild will be consumed by AI; how can it not be? You’ve got a system which can produce content at a rate several orders of magnitude higher a human, of course the content ecosystem will be dominated by AI-generated content, so of course the quality of content generated by AI systems, which rely on non-AI-generated content to train, will go down over time.

r3trohack3r · 3 years ago

I feel like we are nearing peak "uncurated" content, both for humans and machines. Humans are still grappling with our novel abundance problems.

As we move forward, suspect we will see an increase in curation services and AI models will do more with less. You can bootstrap a productive adult human on an almost infintismal slice of the training sets we are using for the current gen of AI, can't imagine future approaches are going to need such an unbounded input to get better results - but might be wrong!

If content is curated for its quality, whether or not it's AI generated (or assisted) doesn't matter.

joshspankit · 3 years ago

We’d have to adjust capitalism to deal with the “novel abundance” problems. Most of the drive for novel content/audiences is simply to decide which people get a cut of the revenue (and/or audience).

If we focused on quality and stopped caring about who gets paid for what I suspect that not only would we have better quality overall but we’d also push the boundaries much faster thus making things even more interesting.

welshwelsh · 3 years ago

I think that in addition to influencing future models, AI content will also influence how humans think and write. People will start ironically and unironically copying GPT's style in their own writing, causing human produced content to increasing resemble AI content.

High school students that are prohibited from using AI for their essays will have a bad time. Even if they don't use AI chatbots themselves, they will unknowingly cite sources that were written by AI, or were written by someone who learned about the topic by asking ChatGPT.

messe · 3 years ago

Hmm, forgetting natural language for a moment and instead considering programming languages: it’s pretty easy to generate nonsense but semi-plausible looking ASTs without the help of AI. Could this be used to attack GitHub’s copilot?

Step 1. Release a tool that generates nonsense code across a thousand repositories, and allow anybody to publish crap to GitHub.

Step 2. Copilot trains on those nonsensical repositories because it can’t distinguish them from the real thing.

Step 3. Copilot begins to generate similar crap.

theRealMe · 3 years ago

Imagine this as a security attack vector. Instead of nonsense, spam a bunch of repos with code that does a specific thing but in a very hard to understand way. Then add in a small piece of very hard to understand, but legit looking malicious code. Copilot trains on it and then starts feeding it to developers around the world. Probably easier ways to achieve this, but interesting to think about.

empyrrhicist · 3 years ago

You'd probably also have to botspam stars, issues, pull requests.

kaetemi · 3 years ago

Those blogs already exist. Pretty much 90% of the results I see in Google for non-technical household related queries. Just incoherent rambling that sounds plausible but is complete nonsense.

visarga · 3 years ago

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

Not necessarily all AI contents are bad and all human contents are good. We need a way to separate the good from the bad, not AI from human, and it might be impossible to do 100% correct anyway.

A4ET8a8uTh0 · 3 years ago

I think I would compare it to Stack Overflow. Some of the solutions do exist there, but not all are applicable to the use case or the exact circumstances the person asks there and yet the prompt used by AI would remain the same. SO has its rating system, but it has the same issue as the sentence above. From that perspective, we have identified potentially good human output ( assuming it wasn't already pollinated with AI output, which seems less and less likely ) that should only be accessible by humans and we would need a separate forum for bad AI output ( that should be verified by humans as bad but maybe only be accessible by AI once verified ).

I am just spitballing. I do not really have a solution in mind. It just sounds like an interesting problem going forward.

orbital-decay · 3 years ago

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits?

No, because models are already trained like that. Datasets for large models are too vast for a human to even know what's inside, let alone label them manually. So instead they are processed (labeled, cropped, etc) by other models, with humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.

So it's actually the opposite - the hybrid bootstrapping approach that combines human curation and ML labeling of bulk low-quality data typically delivers far better results than training on a small but 100% manual dataset.

visarga · 3 years ago

> They are processed by other models, humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.

A great description of what actually happens when you deal with massive datasets. One way to inspect a large dataset is to cluster it, and then look at just a few samples from each cluster, to get an overview.

wincy · 3 years ago

Okay for image models I think humans could help a lot more than we give them credit for. We can read and parse images WAY faster than you might think.

What if we just crowdsource and have a new Folding@home protein thing but this time it’s for classifying data sets? LAION-5B has 5 billion image text pairs, if we got 10,000 people together that’d just be… 100,000 per person which would take… awhile but not forever. Humans can notice discrepancies super quickly. Like a slide show display the image and the text pair at a speed set by the user, and pause and tweak ones that are outright wrong or shitty.

Boom, refined image set.

Maybe? I’m looking at the LAION-5B example sets on their website and it seems to literally be this simple. A lot of the images seemed pretty poorly tagged. You get a gigantic manually tagged data set, at least for image classification.

davidkunz · 3 years ago

I assume at some point, ChatGPT needs some kind of text ranking. Popular texts are usually correct (content and presentation) and useful, so they should rank higher. At some point, low-quality texts are filtered out. Personally, I don't care if a text is written by a human or a machine as long as it's good.

joshspankit · 3 years ago

Popular as defined by who? Because I think we all know that popular as defined by Pagerank, sales, or visits has a lot of issues.

layer8 · 3 years ago

Information quality and authenticity will become a cat-and-mouse game just like information security.

jjtheblunt · 3 years ago

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

I just had my eyes opened reading that, because humans also do exactly that, inadvertently.

jjtheblunt · 3 years ago

another afterthought : does that mean humans are artificial AI?

CuriouslyC · 3 years ago

This isn't an issue, because it's possible to add prose quality and content accuracy scores to training data and train the model to predict those quantities during generation, which would allow you to condition the generation on high prose quality/accuracy. It just requires a small update to the model, and a shit ton of data set annotation time.

Likewise, images can be scored for aesthetics and consistency and models updated to predict and condition in the same way.

rcme · 3 years ago

How would you score them at scale without training some model to differentiate real vs. AI content? If you need to train such a model, where would you get the data from?

joshspankit · 3 years ago

Then you’d just create the AI equivalent of black-hat SEO.

foobarbecue · 3 years ago

It reminds me of a dog eating its own vomit.

euroderf · 3 years ago

Excellent analogy. Filed for future use.

shadowgovt · 3 years ago

There will likely be selective pressure from human interaction with the data to curate good content above bad.

After all, we had the issue of millions of auto-generated bad pages in the web 1.0 SEO days. Search engines addressed it by figuring out how to rely more heavily on human behavior signals as an indication of value of data.

joshspankit · 3 years ago

The thing that concerns me is that we may end up with a downward trend in accuracy.

If AI writes the bulk of the content, how long will it be before people simply do not put in the work to make sure things are true or put in the work to discover and understand new true things?

bogdanoff_2 · 3 years ago

>Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.

I feel like a similar thing already happened with YouTube recommendations

user3939382 · 3 years ago

I brought this exact issue up recently https://news.ycombinator.com/item?id=34252938

noduerme · 3 years ago

>> a downwards trend in quality

Have you googled for reviews on toaster ovens recently?

A4ET8a8uTh0 · 3 years ago

To your point, anecdotally, the system is heavily gamed. The other day, I saw reviews for a restaurant pop up that did not quite open yet. Either reviewers got a sneak peek behind the chef's curtain or those reviews are not quite true.

Sadly, word of mouth again becomes defacto the only semi-reliable way to separate crap from non-crap and even that comes with its own set of issues and attempts at gaming.

There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.

One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?

With all of that out of the way, yes, I would rather read text produced by human beings, not because of its quality--the AI knows, sometimes humans can't help themselves and just keep writing the same thing over and over, specially when it comes to fiction--but just to defend human dominance.

Dalewyn · 3 years ago

>What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

We could get this whole discussion back to some semblance of sanity if we stopped calling any form of remotely complicated automation "AI". The term might as well be meaningless now.

Nothing about any of all these "AIs" is intelligent in the sense of the layman's understanding of artificial intelligence, let alone intelligence of biological and philosophical schools of thought.

artpi · 3 years ago

> There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.

Frankly I had the same thought writing it :D

It's more of a stake in the ground sort of a thing I guess? What I really want is somebody saying "hey, there is an open standard already here" so I can use it.

nonrandomstring · 3 years ago

The idea has some legs, but they are weak for the many reasons pointed out to me by fair criticism of "digital veganism". The main one is that labelling is one small part of quality. Tijmen Schep in his 2016 "Design My Privacy" [1] proposed some really cool ideas around quality and trustworthiness labelling of IoT/mobile devices, but ran into the same issues. Responsibility ultimately lies with the consumer, and so long as consumers remain uneducated as to why low quality is harmful, and cannot verify the provenance of what they consume or the harmful effects, nothing will change.

Right now we seem to be at the stage of "It's just McDonald's/KFC for data - junk food is convenient, cheap and not a problem - therefore mass production generative content won't be a problem".

The food analogy is powerful, but has limits, and I urge you to dig into Digital Vegan [2] if you want to take it further.

[1] https://www.tijmenschep.com/design-my-privacy/

[2] https://digitalvegan.net

felideon · 3 years ago

By contrast, I enjoyed the entire piece, read another one of your posts, and subscribed to your newsletter.

arketyp · 3 years ago

It's interesting that the article mentions Kosher rules as if abiding by them is trivial and as if the practice isn't ridden with gray areas.

sigriv · 3 years ago

>One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

>And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?

The line is drawn at human ownership/responsibility. A piece of content can be 'AI tainted' or '100% produced by AI', what makes the difference is if a human takes the responsibility of the end product or not.

alpos · 3 years ago

Responsibility and ownership always lies with the humans. Even supposedly 100% AI generated content is still coming from a process started and maintained by humans. Currently also prompted by a human.

The humans running those processes can attempt to deny ownership or responsibility if they so choose but whenever it matters such as in law or any other arena dealing with liability or ownership rights, the humans will be made to own the responsibility.

Same as for self-driving cars. We can debate about who the operator is and to what extent the manufacturers, the occupants, or the owners are responsible for whether the car causes harm but we'll never try to punish the car while calling all humans involved blameless. The point of holding people responsible for outcomes and actions is to drive meaningful change in human behaviors in order to reduce harms and encourage human flourishing.

In terms of ownership and intellectual property, again the point of even having rules is to manage interactions between humans so we can behave civilly towards each other. There can be no meaningful category of content produced "100%" by AI unless AI become persons under the law or as considered by most humans.

If an AI system can ever truly produce content on its own volition, without any human action taken to make that specific thing happen, then that system would be a rational actor on par with other persons and we'll probably begin the debate over whether AI systems should be treated as people in society and under the law. That may even be a new category distinct from human persons such as it is with the concept of corporate persons.

xpe · 3 years ago

> ... yes, I would rather read text produced by human beings, not because of its quality ... (snip) ... but just to defend human dominance.

One could make a strong argument that defending moral principles is preferable to preferring the underlying creative force to have a particular biological composition.

As an example, I don't want a system to incentivize humans kept as almost-slaves to retype AI generated content.

How can one tell the difference between all the gradations of "completely" human generated to not?