Readit News logoReadit News
yowzadave · 2 years ago
Musk has been fixated on this idea that Twitter is a huge treasure trove of data for AI training, often complaining about AI companies using its data for training purposes. I had just assumed that companies like OpenAI were just crawling the web, which included Twitter, rather than targeting Twitter in particular. Is the Twitter data really that valuable for training AIs? What particular qualities does it have that make it particularly useful compared to any other freely available data set?
adtac · 2 years ago
(1) Twitter's data is accurately timestamped, (2) there's new data constantly flowing in talking about recent events. There's no other source like that in English other than Reddit.

AFAIU neither of those are relevant to GPT-like architectures but it's not inconceivable to think there might be a model architecture in the future that takes advantage of those. Purely from a information theoretic POV, there's non-zero bits of information in the timestamp and relative ordering of tweets.

temporalparts · 2 years ago
> There's no other source like that in English other than Reddit

1) Facebook Posts/Comments, 2) Instagram Posts/Comments, 3) Youtube Comments, 4) Gmail content, 5) LinkedIn Comments, 6) TikTok contents / comments

X and Reddit are definitely valuable, but they're definitely not unique. I think Meta and Google have inherent advantages because their data is not accessible to LLM competitors and they have the actual capabilities to build great LLMs.

Unless X decides to tap AI talent in China, they're going to have a REALLY hard time spinning up a competitive LLM team compared to OpenAI, Google, and Meta, which I think are the top three LLM companies in that order.

peter422 · 2 years ago
Just take an aggregate of every news article written in the top 100 newspapers.

That's high quality content, timestamped and about current events.

There is very little content on Twitter that compared in quality to one will written news article.

matchagaucho · 2 years ago
It could also be argued that the Twitter firehose requires substantial RLHF, de-biasing and moderation controls because of its colloquial nature.
mejutoco · 2 years ago
Wire news services come to mind as an alternative.
berkes · 2 years ago
> there's new data constantly flowing in talking about recent events.

In which the distinction between "data" and "information" is crucial. Especially now that the "floodgates" have been re-opened regarding misinformation, bots, impersonators and the likes.

Data is crucial when in need of training body. But information is crucial when the training must be tuned, limited or just verified.

Deleted Comment

holler · 2 years ago
What about HN?
summerlight · 2 years ago
I think Twitter is not a great place to dig out training data in general. Most of its data is not well structured and/or tagged. Its signal to noise ratio is relatively low. Its texts are generally very short and very dependent on context. Twitter has been largely failing as a short-form video platform. There's some trace of algorithmic generation of interest/topic-based feed on Twitter, but you know, its quality was never great. I guess it's just a hard problem given Twitter's environment.

Its strength is freshness and volume, but I guess these can be achieved without Twitter if you have a strong web crawling infrastructure? Also, the current generation of LLM is not really capable of exploiting minute-level freshness... at least for now.

dheera · 2 years ago
Also, Twitter is not where people go to be nice. Twitter incentivizes snarky, disparaging, curt behavior because (a) it limits message length to an extent where nice speech doesn't have a place (b) saying nice things gets you likes while saying not-nice things gets you retweets, and retweets are more highly valued by the algorithm.
nonfamous · 2 years ago
Twitter’s data would be very valuable for generating tweet-like content: short, self-contained snippets, images and video.

There’s not a lot of data in Twitter today resembling long-form content: essays, news articles, books, scientific papers, etc. That’s probably why Twitter/X expanded the tweet size limit, to be able to collect such data.

jahewson · 2 years ago
Yep! It would be a great it generating hot takes and click bait.
cpeterso · 2 years ago
If Twitter was as much a treasure trove of user data that Elon thinks it is, then why is Twitter's ad targeting so much worse than Facebook's and Instagram's?
elondaits · 2 years ago
Twitter has the largest database of tweets. If you want an AI that writes tweets, there’s nothing better. Why? Twitter could offer a service that bypasses the need for community manager… just feed it a press release, or product website, and it will provide a stream of well crafted ad tweets… or astroturf, even.
throwaway4aday · 2 years ago
Doesn't really matter what the content is as long as the sequences of tokens make sense. That's the goal: predict the next token given the previous N tokens. Higher level structures like tweets just fall out of that but I wouldn't be too surprised if a model trained only on tweets could also generalize to some other structures.

Deleted Comment

emodendroket · 2 years ago
Twitter and Reddit are extremely valuable to LLMs and makers of both are really kicking themselves over missing the boat with open APIs.
lolc · 2 years ago
Now they're kicking themselves into irrelevance by restricting access.
mistymountains · 2 years ago
Comment datasets are valuable for conversational AI, it’s the same reason Reddit locked down the API I imagine.
smrtinsert · 2 years ago
Training anything on a network of bot traffic. What a time to be alive...
ilamont · 2 years ago
It's not a change confined to Twitter; it's a glimpse into what could be the new normal for the internet.

You can bet Google/Gmail/YouTube, Amazon, Microsoft, TikTok, and every other Internet platform that works with user-generated content will soon do the same ... if they haven't done so already.

dayvid · 2 years ago
GMail's already been using your data for classification, data mining, etc. other AI purposes. I would be legitimately surprised if they're NOT using processed GMail data (potentially removing sensitive data, etc.) in training their LLMs or other AI projects.
crazygringo · 2 years ago
Yup. To be clear, that's with the free version though.

If you pay, corporate versions of Google Workspace won't train on your data. That's very much by design, since companies don't want anything internal ever being exposed.

But with the free version, that's part of what you're "paying" for it to remain free.

lern_too_spel · 2 years ago
They'd certainly use it to train their spam classifiers, but using that data in a generative model shared with other users would risk information leaks, so they wouldn't do that.
frob · 2 years ago
They've been doing it for years. Everything you put on their servers is a source for them to train on. At FB Messenger in the mid to late teens, the suggestion and auto-reply models were trained on the entire corpus of unencrypted messages sent between users.
systems_glitch · 2 years ago
One of the big reasons why I went with Proton Mail for whitelabeled email service when I stopped hosting my own. Nothing to scrape on the server side.
manuelabeledo · 2 years ago
The absolute irony of Google doing this.

Dead Comment

lynndotpy · 2 years ago
The terms now also have a class-action waiver clause
waveBidder · 2 years ago
how the fuck are those legal? frankly more disturbing to me than the AI training one, which I find less objectionable than selling people's attention to advertisers
AnthonyMouse · 2 years ago
>how the fuck are those legal?

The US doesn't have loser pays and has some of the most expensive litigation in the world, which has created all kinds of problems. Someone can file a lawsuit against you knowing that they're unlikely to win, but in so doing they could cost you hundreds of thousands of dollars for lawyers, so why don't you just go ahead and settle for tens of thousands of dollars? It will cost you less to settle than to win in court.

This flaw was made to scale by class action lawsuits, which more than any other should be loser pays, because there is little question that thousands of people who have each been harmed to the tune of $100 could each front $10 for a meritorious lawsuit. But instead you get opportunistic lawyers signing up anyone they can find for questionable claims, so they can reach a settlement where the plaintiffs each get $7 -- or a $7 gift certificate -- and the lawyers get millions.

This was rightly regarded as a problem but the lawyers had enough political power to prevent a good solution, so what we got instead was to make it easier to force binding arbitration and opt out of class action suits.

Lawyers ruin everything. They even ruin lawyers.

danans · 2 years ago
IANAL. According to [1] they are legal per Supreme Court precedent, but it also says:

"Contract formation is increasingly scrutinised. Following Concepcion and its progeny, some courts have focused on issues of contract formation to determine whether the consumer in fact agreed to arbitration and the class action waiver. This inquiry is largely confined to online transactions, where a consumer is deemed to have consented to arbitration by using the business's website to purchase goods or services. These contracts fall within the rubric of "clickwrap," "browsewrap," or "webwrap" agreements and their enforceability is beyond the scope of this article. However, it is important to note that the courts will refuse to enforce class action waivers and arbitration agreements in such agreements when the arbitration provisions were insufficiently conspicuous to ensure the consumer objectively agreed to their terms."

The article also mentions that non-negotiable consumer contracts are viewed with more suspicion by some courts.

1. https://content.next.westlaw.com/practical-law/document/I9f1....

nostromo · 2 years ago
Class action lawsuits suck for victims. The only people it’s good for are lawyers, who become fabulously wealthy, while the victims get a check for two dollars in the mail.

Individual lawsuits are better for everyone.

danans · 2 years ago
> AI training one, which I find less objectionable than selling people's attention to advertisers

How do you know that the AI won't be used to sell people's attention to advertisers?

cjohnson318 · 2 years ago
Unfortunately, "legal" means whatever certain specific people agree that it means.
LadyCailin · 2 years ago
That backfired on them when all the Twitter employees they fired sued them in individual lawsuits, so now they have to do the same discovery process and everything else over and over for each case. Lol. That only backfired because enough individuals that had the means to bring a full suit at the same time though. Unlikely to happen normally.
dboreham · 2 years ago
Oh, I see. For a few seconds I was wondering how a windowing system uses data for AI...
butz · 2 years ago
Other big companies should also shorten names of their premiere services, just to add to confusion.
AnthonyMouse · 2 years ago
If anyone is wondering, Netflix is now x.net, Amazon is a.com, Apple is a.us, Android is a.tv and Azure is a.mov.

Facebook will be known as y.us.

Robdel12 · 2 years ago
I can’t imagine anything worse for the world than an AI trained on whatever Twitter users are tweeting.
laylomo2 · 2 years ago
Can't wait to add all the thought leader hot takes bots to my ban lists.

Dead Comment

aero-glide2 · 2 years ago
This was done because Elon's other company xAI needs this data.

Deleted Comment

londons_explore · 2 years ago
Nearly all terms allow them to use data to improve the product...

"Replacing some shoddy heuristics with a massive AI model" seems like product improvement to me.

Therefore, they were already allowed to train AI models with your data.

dagaci · 2 years ago
These terms are probably needed to cover public usage scenarios where someone could claim ownership or privacy violations even over a public forum.
zamalek · 2 years ago
So that they can train a chatbot? I thought Elon set out to eliminate bots.
candiddevmike · 2 years ago
Use it to heaven ban folks: https://cosmosmagazine.com/technology/internet/heaven-bannin...

CEO can use it on Elon to keep him placated

HumblyTossed · 2 years ago
Did anyone actually believe him?
serf · 2 years ago
Elon has the superpower that makes it so that regardless of the question asked/answered, the audience is 40-40 "always believe" and "never believe", the rest being skeptics or 'don't cares'.

This often works to his advantages; the 'always believe' are loud.

lawn · 2 years ago
Surprisingly, a lot of people did (and some still do).

How many times have Trump been caught lying or doing a 180 from one day to the next? And there are still tons of people believing him.

taneq · 2 years ago
Oh, that guy with the self-driving cars and the self-landing rockets and the creepy humanoid robots that can sorta walk? Yeah totally, he hates automated stuff.
lowercased · 2 years ago
other peoples' bots.
djbusby · 2 years ago
You've become the very thing you swore to destroy!
numpad0 · 2 years ago
That guy had already smashed into the wall trying to get into the Internet, and not much relevant anymore. I mean, there's inertia, sure.
nailer · 2 years ago
To eliminate spam bots. He didn't set out to eliminate AI.

I'm honestly surprised that anyone on HN wouldn't understand this.

thefurdrake · 2 years ago
Ah yes, the platform formerly known as twitter shall surely NEVER allow the weights produced from analysis of its conversational data to go into the production of spambots!

After all, there's definitely no market for a product that has been finely-trained on social media posts to the point where it can perfectly ape social media posts. There's no money for Twitter to make there, so the technology it produces from this analysis shall certainly never find its way into spam bots that create content indistinguishable from genuine humans on Twitter.

ToValueFunfetti · 2 years ago
It really is baffling. Removing bots from Twitter is totally orthogonal to training a bot on Twitter data- arguably the former is a prerequisite for the latter. And we've known that Musk wants to train a competing chatbot for at least 6 months now. Has he ever said anything to indicate that he wants to eliminate bots entirely?

Maybe I'm overthinking it and this is just a rhetorical gotcha.

comeonbro · 2 years ago
Yeah, that this willfully stupid a comment is so high up on so important a topic is pretty damning of the value of the conversation here.