Musk has been fixated on this idea that Twitter is a huge treasure trove of data for AI training, often complaining about AI companies using its data for training purposes. I had just assumed that companies like OpenAI were just crawling the web, which included Twitter, rather than targeting Twitter in particular. Is the Twitter data really that valuable for training AIs? What particular qualities does it have that make it particularly useful compared to any other freely available data set?
(1) Twitter's data is accurately timestamped, (2) there's new data constantly flowing in talking about recent events. There's no other source like that in English other than Reddit.
AFAIU neither of those are relevant to GPT-like architectures but it's not inconceivable to think there might be a model architecture in the future that takes advantage of those. Purely from a information theoretic POV, there's non-zero bits of information in the timestamp and relative ordering of tweets.
X and Reddit are definitely valuable, but they're definitely not unique. I think Meta and Google have inherent advantages because their data is not accessible to LLM competitors and they have the actual capabilities to build great LLMs.
Unless X decides to tap AI talent in China, they're going to have a REALLY hard time spinning up a competitive LLM team compared to OpenAI, Google, and Meta, which I think are the top three LLM companies in that order.
> there's new data constantly flowing in talking about recent events.
In which the distinction between "data" and "information" is crucial. Especially now that the "floodgates" have been re-opened regarding misinformation, bots, impersonators and the likes.
Data is crucial when in need of training body. But information is crucial when the training must be tuned, limited or just verified.
I think Twitter is not a great place to dig out training data in general. Most of its data is not well structured and/or tagged. Its signal to noise ratio is relatively low. Its texts are generally very short and very dependent on context. Twitter has been largely failing as a short-form video platform. There's some trace of algorithmic generation of interest/topic-based feed on Twitter, but you know, its quality was never great. I guess it's just a hard problem given Twitter's environment.
Its strength is freshness and volume, but I guess these can be achieved without Twitter if you have a strong web crawling infrastructure? Also, the current generation of LLM is not really capable of exploiting minute-level freshness... at least for now.
Also, Twitter is not where people go to be nice. Twitter incentivizes snarky, disparaging, curt behavior because (a) it limits message length to an extent where nice speech doesn't have a place (b) saying nice things gets you likes while saying not-nice things gets you retweets, and retweets are more highly valued by the algorithm.
Twitter’s data would be very valuable for generating tweet-like content: short, self-contained snippets, images and video.
There’s not a lot of data in Twitter today resembling long-form content: essays, news articles, books, scientific papers, etc. That’s probably why Twitter/X expanded the tweet size limit, to be able to collect such data.
If Twitter was as much a treasure trove of user data that Elon thinks it is, then why is Twitter's ad targeting so much worse than Facebook's and Instagram's?
Twitter has the largest database of tweets. If you want an AI that writes tweets, there’s nothing better. Why? Twitter could offer a service that bypasses the need for community manager… just feed it a press release, or product website, and it will provide a stream of well crafted ad tweets… or astroturf, even.
Doesn't really matter what the content is as long as the sequences of tokens make sense. That's the goal: predict the next token given the previous N tokens. Higher level structures like tweets just fall out of that but I wouldn't be too surprised if a model trained only on tweets could also generalize to some other structures.
It's not a change confined to Twitter; it's a glimpse into what could be the new normal for the internet.
You can bet Google/Gmail/YouTube, Amazon, Microsoft, TikTok, and every other Internet platform that works with user-generated content will soon do the same ... if they haven't done so already.
GMail's already been using your data for classification, data mining, etc. other AI purposes. I would be legitimately surprised if they're NOT using processed GMail data (potentially removing sensitive data, etc.) in training their LLMs or other AI projects.
Yup. To be clear, that's with the free version though.
If you pay, corporate versions of Google Workspace won't train on your data. That's very much by design, since companies don't want anything internal ever being exposed.
But with the free version, that's part of what you're "paying" for it to remain free.
They'd certainly use it to train their spam classifiers, but using that data in a generative model shared with other users would risk information leaks, so they wouldn't do that.
They've been doing it for years. Everything you put on their servers is a source for them to train on. At FB Messenger in the mid to late teens, the suggestion and auto-reply models were trained on the entire corpus of unencrypted messages sent between users.
how the fuck are those legal? frankly more disturbing to me than the AI training one, which I find less objectionable than selling people's attention to advertisers
The US doesn't have loser pays and has some of the most expensive litigation in the world, which has created all kinds of problems. Someone can file a lawsuit against you knowing that they're unlikely to win, but in so doing they could cost you hundreds of thousands of dollars for lawyers, so why don't you just go ahead and settle for tens of thousands of dollars? It will cost you less to settle than to win in court.
This flaw was made to scale by class action lawsuits, which more than any other should be loser pays, because there is little question that thousands of people who have each been harmed to the tune of $100 could each front $10 for a meritorious lawsuit. But instead you get opportunistic lawyers signing up anyone they can find for questionable claims, so they can reach a settlement where the plaintiffs each get $7 -- or a $7 gift certificate -- and the lawyers get millions.
This was rightly regarded as a problem but the lawyers had enough political power to prevent a good solution, so what we got instead was to make it easier to force binding arbitration and opt out of class action suits.
IANAL. According to [1] they are legal per Supreme Court precedent, but it also says:
"Contract formation is increasingly scrutinised. Following Concepcion and its progeny, some courts have focused on issues of contract formation to determine whether the consumer in fact agreed to arbitration and the class action waiver. This inquiry is largely confined to online transactions, where a consumer is deemed to have consented to arbitration by using the business's website to purchase goods or services. These contracts fall within the rubric of "clickwrap," "browsewrap," or "webwrap" agreements and their enforceability is beyond the scope of this article. However, it is important to note that the courts will refuse to enforce class action waivers and arbitration agreements in such agreements when the arbitration provisions were insufficiently conspicuous to ensure the consumer objectively agreed to their terms."
The article also mentions that non-negotiable consumer contracts are viewed with more suspicion by some courts.
Class action lawsuits suck for victims. The only people it’s good for are lawyers, who become fabulously wealthy, while the victims get a check for two dollars in the mail.
That backfired on them when all the Twitter employees they fired sued them in individual lawsuits, so now they have to do the same discovery process and everything else over and over for each case. Lol. That only backfired because enough individuals that had the means to bring a full suit at the same time though. Unlikely to happen normally.
Elon has the superpower that makes it so that regardless of the question asked/answered, the audience is 40-40 "always believe" and "never believe", the rest being skeptics or 'don't cares'.
This often works to his advantages; the 'always believe' are loud.
Oh, that guy with the self-driving cars and the self-landing rockets and the creepy humanoid robots that can sorta walk? Yeah totally, he hates automated stuff.
Ah yes, the platform formerly known as twitter shall surely NEVER allow the weights produced from analysis of its conversational data to go into the production of spambots!
After all, there's definitely no market for a product that has been finely-trained on social media posts to the point where it can perfectly ape social media posts. There's no money for Twitter to make there, so the technology it produces from this analysis shall certainly never find its way into spam bots that create content indistinguishable from genuine humans on Twitter.
It really is baffling. Removing bots from Twitter is totally orthogonal to training a bot on Twitter data- arguably the former is a prerequisite for the latter. And we've known that Musk wants to train a competing chatbot for at least 6 months now. Has he ever said anything to indicate that he wants to eliminate bots entirely?
Maybe I'm overthinking it and this is just a rhetorical gotcha.
AFAIU neither of those are relevant to GPT-like architectures but it's not inconceivable to think there might be a model architecture in the future that takes advantage of those. Purely from a information theoretic POV, there's non-zero bits of information in the timestamp and relative ordering of tweets.
1) Facebook Posts/Comments, 2) Instagram Posts/Comments, 3) Youtube Comments, 4) Gmail content, 5) LinkedIn Comments, 6) TikTok contents / comments
X and Reddit are definitely valuable, but they're definitely not unique. I think Meta and Google have inherent advantages because their data is not accessible to LLM competitors and they have the actual capabilities to build great LLMs.
Unless X decides to tap AI talent in China, they're going to have a REALLY hard time spinning up a competitive LLM team compared to OpenAI, Google, and Meta, which I think are the top three LLM companies in that order.
That's high quality content, timestamped and about current events.
There is very little content on Twitter that compared in quality to one will written news article.
In which the distinction between "data" and "information" is crucial. Especially now that the "floodgates" have been re-opened regarding misinformation, bots, impersonators and the likes.
Data is crucial when in need of training body. But information is crucial when the training must be tuned, limited or just verified.
Deleted Comment
Its strength is freshness and volume, but I guess these can be achieved without Twitter if you have a strong web crawling infrastructure? Also, the current generation of LLM is not really capable of exploiting minute-level freshness... at least for now.
There’s not a lot of data in Twitter today resembling long-form content: essays, news articles, books, scientific papers, etc. That’s probably why Twitter/X expanded the tweet size limit, to be able to collect such data.
Deleted Comment
You can bet Google/Gmail/YouTube, Amazon, Microsoft, TikTok, and every other Internet platform that works with user-generated content will soon do the same ... if they haven't done so already.
If you pay, corporate versions of Google Workspace won't train on your data. That's very much by design, since companies don't want anything internal ever being exposed.
But with the free version, that's part of what you're "paying" for it to remain free.
Dead Comment
The US doesn't have loser pays and has some of the most expensive litigation in the world, which has created all kinds of problems. Someone can file a lawsuit against you knowing that they're unlikely to win, but in so doing they could cost you hundreds of thousands of dollars for lawyers, so why don't you just go ahead and settle for tens of thousands of dollars? It will cost you less to settle than to win in court.
This flaw was made to scale by class action lawsuits, which more than any other should be loser pays, because there is little question that thousands of people who have each been harmed to the tune of $100 could each front $10 for a meritorious lawsuit. But instead you get opportunistic lawyers signing up anyone they can find for questionable claims, so they can reach a settlement where the plaintiffs each get $7 -- or a $7 gift certificate -- and the lawyers get millions.
This was rightly regarded as a problem but the lawyers had enough political power to prevent a good solution, so what we got instead was to make it easier to force binding arbitration and opt out of class action suits.
Lawyers ruin everything. They even ruin lawyers.
"Contract formation is increasingly scrutinised. Following Concepcion and its progeny, some courts have focused on issues of contract formation to determine whether the consumer in fact agreed to arbitration and the class action waiver. This inquiry is largely confined to online transactions, where a consumer is deemed to have consented to arbitration by using the business's website to purchase goods or services. These contracts fall within the rubric of "clickwrap," "browsewrap," or "webwrap" agreements and their enforceability is beyond the scope of this article. However, it is important to note that the courts will refuse to enforce class action waivers and arbitration agreements in such agreements when the arbitration provisions were insufficiently conspicuous to ensure the consumer objectively agreed to their terms."
The article also mentions that non-negotiable consumer contracts are viewed with more suspicion by some courts.
1. https://content.next.westlaw.com/practical-law/document/I9f1....
Individual lawsuits are better for everyone.
How do you know that the AI won't be used to sell people's attention to advertisers?
Facebook will be known as y.us.
Dead Comment
Deleted Comment
"Replacing some shoddy heuristics with a massive AI model" seems like product improvement to me.
Therefore, they were already allowed to train AI models with your data.
CEO can use it on Elon to keep him placated
This often works to his advantages; the 'always believe' are loud.
How many times have Trump been caught lying or doing a 180 from one day to the next? And there are still tons of people believing him.
I'm honestly surprised that anyone on HN wouldn't understand this.
After all, there's definitely no market for a product that has been finely-trained on social media posts to the point where it can perfectly ape social media posts. There's no money for Twitter to make there, so the technology it produces from this analysis shall certainly never find its way into spam bots that create content indistinguishable from genuine humans on Twitter.
Maybe I'm overthinking it and this is just a rhetorical gotcha.