Possible, but their 'advantage' is unlikely to last more than a few years.
Possible, but their 'advantage' is unlikely to last more than a few years.
The scariest thing is that there are people who advocate for it. Because humans are dangerous, I guess, so it's better to preemptively enslave them.
Sample
> Social control in the sense of not wanting lots of unemployed and restless youths. Having a system where long term and steady work is required in order to "live a good life" implies control - you have to act right and follow the rules in order to keep a job, which is itself necessary in order to have enough food and other necessities.
> Those making this argument here I believe are also making an argument for an alternative where the productivity of society is more equally spread, without the need to make everyone work for it.
> I agree that it's a system of social control, but I don't think it's nefarious or bad. We really don't want to live in a society where 25-year-old men don't have meaningful work to do and roam the streets getting into trouble.
The argument also doesn't _really_ make sense - there's already socially accepted system of 'social control' which _directly_ keeps people following the rules. The law.
Also unclear why lack of work would cause young people to "roam the streets" instead of staying home and roaming the internets. As they're already increasingly doing in free time.
Uh-oh. When I hear that, I'm assuming that it will:
1. Going to be more tightly controlled
2. Going to be more strongly censored
3. Probably not going to be actually safer for people like me.
> This article is from The Technocrat
Is it now?... that does not bode well.
> If you use Google, Instagram, Wikipedia, or YouTube
Oh, you mean _that_ Internet. 4 sites which get a huge part of the traffic. Well, the first, second and fourth of these are quite unsafe: They surveil your activities for commercial manipulation purposes and also let the US (and maybe other) governments get some of that information.
As for Wikipedia, its editorial/censorship/moderation policies are variegated and complex, and while I'm not well-read about that, it does seem that they have at least some sort of a mainstream-politics bias.
> The DSA will require these companies to assess risks on their platforms, like the likelihood of illegal content
Lots of things can be illegal, especially in world states with more restrictive laws. That doesn't sound very safe.
> The DSA will require these companies to assess risks on their platforms, like ... election manipulation,
Ah, now we're getting somewhere. So this is formalizing the drumming-up-hysteria-about-Russia shenanigans we've seen in recent years. Once there were witches and gremlins and leprechauns who caused mischief, now it's those evil Russian hackers, which were sent by evil Putin, since why not, right? Just recently we read in the Twitter files how the twitter people were pressured by the US government to come up with supposed Russian meddling, and they were panicking since there wasn't any, so they had to cook something up.
> Perhaps most important, the DSA requires that companies significantly increase transparency
That's good, but about what?
> ... through reporting obligations for “terms of service”
Uh, that's not so interesting. Plus, they still get to have outrageous "terms of service". Those things shouldn't be enforceable anyway, it's not like you can seriously negotiate those terms.
> hate speech, misinformation, and violence.
And who decides which information is valid and which isn't? Also, what if governments engages in misinformation or violence, as they often do? I'm pretty sure it's going to be the "information we don't like", which is sometimes misinformation, and sometimes - not.
> You will be able to participate in content moderation decisions that companies make and formally contest them
Such platforms should probably just be recognized as semi-public so that commercial companies can't censor them without a court order.
, you're going to start noticing changes to content moderation, transparency, and safety features on those sites over the next six months.
https://www.youtube.com/watch?v=-gGLvg0n-uY
Hehe
> Who are you to decide what's misinformation anyway?
> That sounds like something misinformation terrorist would say.
...
> First, we'll censor any use related to social taboos. Then we'll censor anything we desire. If anyone complains, we'll accuse them of wanting to engage in and promote social taboos.
> I watched a panel on AI (machine learning) at a conference hosted by the European Commission.
> 9 people on the panel
> Everyone agreed that the USA was 100 miles ahead of EU in machine learning and China was 99 miles ahead
> In any case, everyone agreed that in the most important technology of the 21st century, the EU was not on the map.
> The last person on the panel was an entrepreneur.
> He noted that the EU had as many AI startups as Israel (a country 1/50th the size) and, btw, two thirds of those were in London that was heading out the door due to Brexit.
> So basically the EU had 1/3 the AI startups of Israel (this was a few years ago)
> So the panel discussion turned to "What should the EU do?"
> And the more or less unanimous conclusion (except for the entrepreneur) was "We are going to build on the success of GDPR and aim to be the REGULATORY LEADER of machine learning"
> I literally laughed out loud
> Being the "Regulatory Leader" is NOT A REAL THING.
> Imagine it is the early 20th century and imagine that cars were invented and that the USA and China were producing a lot of cars.
> The EU of today would say "Building cars looks hard, but we will be the leader in STOP SIGNs"
> This is defeatism, this is surrender, this is deciding to be a vassal state of the United States and China in the 21st century.
> The EU is already a Web 2 vassal to the US tech companies (none of its own, so it has to try to limit their power)
Modern AI is pretty harmless though, so it doesn't matter yet.
Yes, that's why the only thing people flipping out about "safety" of making them public achieve is making public distrustful about AI safety.
If your model is really that good, unleash it into the open so that others can truly evaluate it-warts and all-and help improve it by identifying the flaws.
When they don't do it, people scream at them (see Galactica)
"Journalists" react like this:
> On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.
> Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.
> However, Meta and other companies working on large language models, including Google, have failed to take it seriously.
Yann LeCunn confirmed this: https://twitter.com/pmarca/status/1631185701864865792
I wonder if they just leaked it onto 4chan themselves, lol.
LLaMA is still only available to the elite.
files_catbox_moe[slash]o8a7xw(dot)torrent
Quote below:
Humans, one might say, are the cyanobacteria of AI: we constantly emit large amounts of structured data, which implicitly rely on logic, causality, object permanence, history—all of that good stuff. All of that is implicit and encoded into our writings and videos and ‘data exhaust’. A model learning to predict must learn to understand all of that to get the best performance; as it predicts the easy things which are mere statistical pattern-matching, what’s left are the hard things. AI critics often say that the long tail of scenarios for tasks like self-driving cars or natural language can only be solved by true generalization & reasoning; it follows then that if models solve the long tail, they must learn to generalize & reason.
Early on in training, a model learns the crudest levels: that some letters like ‘e’ are more frequent than others like ‘z’, that every 5 characters or so there is a space, and so on. It goes from predicted uniformly-distributed bytes to what looks like Base-60 encoding—alphanumeric gibberish.
As crude as this may be, it’s enough to make quite a bit of absolute progress: a random predictor needs 8 bits to ‘predict’ a byte/character, but just by at least matching letter and space frequencies, it can almost halve its error to around 5 bits. Because it is learning so much from every character, and because the learned frequencies are simple, it can happen so fast that if one is not logging samples frequently, one might not even observe the improvement.
As training progresses, the task becomes more difficult. Now it begins to learn what words actually exist and do not exist. It doesn’t know anything about meaning, but at least now when it’s asked to predict the second half of a word, it can actually do that to some degree, saving it a few more bits. This takes a while because any specific instance will show up only occasionally: a word may not appear in a dozen samples, and there are many thousands of words to learn. With some more work, it has learned that punctuation, pluralization, possessives are all things that exist. Put that together, and it may have progressed again, all the way down to 3–4 bits error per character! (While the progress is gratifyingly fast, it’s still all gibberish, though, makes no mistake: a sample may be spelled correctly, but it doesn’t make even a bit of sense.
But once a model has learned a good English vocabulary and correct formatting/spelling, what’s next? There’s not much juice left in predicting within-words. The next thing is picking up associations among words. What words tend to come first? What words ‘cluster’ and are often used nearby each other? Nautical terms tend to get used a lot with each other in sea stories, and likewise Bible passages, or American history Wikipedia article, and so on. If the word “Jefferson” is the last word, then “Washington” may not be far away, and it should hedge its bets on predicting that ‘W’ is the next character, and then if it shows up, go all-in on “ashington”. Such bag-of-words approaches still predict badly, but now we’re down to perhaps <3 bits per character.
What next? Does it stop there? Not if there is enough data and the earlier stuff like learning English vocab doesn’t hem the model in by using up its learning ability. Gradually, other words like “President” or “general” or “after” begin to show the model subtle correlations: “Jefferson was President after…” With many such passages, the word “after” begins to serve a use in predicting the next word, and then the use can be broadened. By this point, the loss is perhaps 2 bits: every additional 0.1 bit decrease comes at a steeper cost and takes more time. However, now the sentences have started to make sense. A sentence like “Jefferson was President after Washington” does in fact mean something (and if occasionally we sample “Washington was President after Jefferson”, well, what do you expect from such an un-converged model).
Jarring errors will immediately jostle us out of any illusion about the model’s understanding, and so training continues. (Around here, Markov chain & n-gram models start to fall behind; they can memorize increasingly large chunks of the training corpus, but they can’t solve increasingly critical syntactic tasks like balancing parentheses or quotes, much less start to ascend from syntax to semantics.
Now training is hard. Even subtler aspects of language must be modeled, such as keeping pronouns consistent. This is hard in part because the model’s errors are becoming rare, and because the relevant pieces of text are increasingly distant and ‘long-range’. As it makes progress, the absolute size of errors shrinks dramatically.
Consider the case of associating names with gender pronouns: the difference between “Janelle ate some ice cream, because he likes sweet things like ice cream” and “Janelle ate some ice cream, because she likes sweet things like ice cream” is one no human could fail to notice, and yet, it is a difference of a single letter. If we compared two models, one of which didn’t understand gender pronouns at all and guessed ‘he’/‘she’ purely at random, and one which understood them perfectly and always guessed ‘she’, the second model would attain a lower average error of barely <0.02 bits per character!
Nevertheless, as training continues, these problems and more, like imitating genres, get solved, and eventually at a loss of 1–2 (where a small char-RNN might converge on a small corpus like Shakespeare or some Project Gutenberg ebooks), we will finally get samples that sound human—at least, for a few sentences.
These final samples may convince us briefly, but, aside from issues like repetition loops, even with good samples, the errors accumulate: a sample will state that someone is “alive” and then 10 sentences later, use the word “dead”, or it will digress into an irrelevant argument instead of the expected next argument, or someone will do something physically improbable, or it may just continue for a while without seeming to get anywhere.
All of these errors are far less than <0.02 bits per character; we are now talking not hundredths of bits per characters but less than ten-thousandths.The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character. What is in that missing >0.4?
Well—everything! Everything that the model misses. While just babbling random words was good enough at the beginning, at the end, it needs to be able to reason our way through the most difficult textual scenarios requiring causality or commonsense reasoning. Every error where the model predicts that ice cream put in a freezer will “melt” rather than “freeze”, every case where the model can’t keep straight whether a person is alive or dead, every time that the model chooses a word that doesn’t help build somehow towards the ultimate conclusion of an ‘essay’, every time that it lacks the theory of mind to compress novel scenes describing the Machiavellian scheming of a dozen individuals at dinner jockeying for power as they talk, every use of logic or abstraction or instructions or Q&A where the model is befuddled and needs more bits to cover up for its mistake where a human would think, understand, and predict.
For a language model, the truth is that which keeps on predicting well—because truth is one and error many. Each of these cognitive breakthroughs allows ever so slightly better prediction of a few relevant texts; nothing less than true understanding will suffice for ideal prediction.
If we trained a model which reached that loss of <0.7, which could predict text indistinguishable from a human, whether in a dialogue or quizzed about ice cream or being tested on SAT analogies or tutored in mathematics, if for every string the model did just as good a job of predicting the next character as you could do, how could we say that it doesn’t truly understand everything? (If nothing else, we could, by definition, replace humans in any kind of text-writing job!)
Why think pretraining or sequence modeling is not another one of them? Sure, if the model got a low enough loss, it’d have to be intelligent, but how could you prove that would happen in practice? (Training char-RNNs was fun, but they hadn’t exactly revolutionized deep learning.) It might require more text than exists, countless petabytes of data for all of those subtle factors like logical reasoning to represent enough training signal, amidst all the noise and distractors, to train a model. Or maybe your models are too small to do more than absorb the simple surface-level signals, and you would have to scale them 100 orders of magnitude for it to work, because the scaling curves didn’t cooperate. Or maybe your models are fundamentally broken, and stuff like abstraction require an entirely different architecture to work at all, and whatever you do, your current models will saturate at poor performance. Or it’ll train, but it’ll spend all its time trying to improve the surface-level modeling, absorbing more and more literal data and facts without ever ascending to the higher planes of cognition as planned. Or…
But apparently, it would’ve worked fine. Even RNNs probably would’ve worked—Transformers are nice, but they seem mostly be about efficiency. (Training large RNNs is much more expensive, and doing BPTT over multiple nodes is much harder engineering-wise.) It just required more compute & data than anyone was willing to risk on it until a few true-believers were able to get their hands on a few million dollars of compute.
GPT-2-1.5b had a cross-entropy WebText validation loss of ~3.3. GPT-3 halved that loss to ~1.73. For a hypothetical GPT-4, if the scaling curve continues for another 3 orders or so of compute (100–1000×) before crossing over and hitting harder diminishing returns , the cross-entropy loss will drop to ~1.24
If GPT-3 gained so much meta-learning and world knowledge by dropping its absolute loss ~50% when starting from GPT-2’s level, what capabilities would another ~30% improvement over GPT-3 gain? (Cutting the loss that much would still not reach human-level, as far as I can tell. ) What would a drop to ≤1, perhaps using wider context windows or recurrency, gain?
What the ... seriously?
How is sentence completion able to generate thoughtful answers to questions? If it goes word by word, or sentence by sentence, how does it generate the structure you ask it (e.g. essay)? There must be something more than just completion. What do the 185 billion parameters encode?
it seems to me, as Stephen Wolfram says, something about our language in the first place, rather than what ChatGPT does.
Quote below:
Humans, one might say, are the cyanobacteria of AI: we constantly emit large amounts of structured data, which implicitly rely on logic, causality, object permanence, history—all of that good stuff. All of that is implicit and encoded into our writings and videos and ‘data exhaust’. A model learning to predict must learn to understand all of that to get the best performance; as it predicts the easy things which are mere statistical pattern-matching, what’s left are the hard things. AI critics often say that the long tail of scenarios for tasks like self-driving cars or natural language can only be solved by true generalization & reasoning; it follows then that if models solve the long tail, they must learn to generalize & reason.
Early on in training, a model learns the crudest levels: that some letters like ‘e’ are more frequent than others like ‘z’, that every 5 characters or so there is a space, and so on. It goes from predicted uniformly-distributed bytes to what looks like Base-60 encoding—alphanumeric gibberish.
As crude as this may be, it’s enough to make quite a bit of absolute progress: a random predictor needs 8 bits to ‘predict’ a byte/character, but just by at least matching letter and space frequencies, it can almost halve its error to around 5 bits. Because it is learning so much from every character, and because the learned frequencies are simple, it can happen so fast that if one is not logging samples frequently, one might not even observe the improvement.
As training progresses, the task becomes more difficult. Now it begins to learn what words actually exist and do not exist. It doesn’t know anything about meaning, but at least now when it’s asked to predict the second half of a word, it can actually do that to some degree, saving it a few more bits. This takes a while because any specific instance will show up only occasionally: a word may not appear in a dozen samples, and there are many thousands of words to learn. With some more work, it has learned that punctuation, pluralization, possessives are all things that exist. Put that together, and it may have progressed again, all the way down to 3–4 bits error per character! (While the progress is gratifyingly fast, it’s still all gibberish, though, makes no mistake: a sample may be spelled correctly, but it doesn’t make even a bit of sense.
But once a model has learned a good English vocabulary and correct formatting/spelling, what’s next? There’s not much juice left in predicting within-words. The next thing is picking up associations among words. What words tend to come first? What words ‘cluster’ and are often used nearby each other? Nautical terms tend to get used a lot with each other in sea stories, and likewise Bible passages, or American history Wikipedia article, and so on. If the word “Jefferson” is the last word, then “Washington” may not be far away, and it should hedge its bets on predicting that ‘W’ is the next character, and then if it shows up, go all-in on “ashington”. Such bag-of-words approaches still predict badly, but now we’re down to perhaps <3 bits per character.
What next? Does it stop there? Not if there is enough data and the earlier stuff like learning English vocab doesn’t hem the model in by using up its learning ability. Gradually, other words like “President” or “general” or “after” begin to show the model subtle correlations: “Jefferson was President after…” With many such passages, the word “after” begins to serve a use in predicting the next word, and then the use can be broadened. By this point, the loss is perhaps 2 bits: every additional 0.1 bit decrease comes at a steeper cost and takes more time. However, now the sentences have started to make sense. A sentence like “Jefferson was President after Washington” does in fact mean something (and if occasionally we sample “Washington was President after Jefferson”, well, what do you expect from such an un-converged model).
Jarring errors will immediately jostle us out of any illusion about the model’s understanding, and so training continues. (Around here, Markov chain & n-gram models start to fall behind; they can memorize increasingly large chunks of the training corpus, but they can’t solve increasingly critical syntactic tasks like balancing parentheses or quotes, much less start to ascend from syntax to semantics.
Now training is hard. Even subtler aspects of language must be modeled, such as keeping pronouns consistent. This is hard in part because the model’s errors are becoming rare, and because the relevant pieces of text are increasingly distant and ‘long-range’. As it makes progress, the absolute size of errors shrinks dramatically.
Consider the case of associating names with gender pronouns: the difference between “Janelle ate some ice cream, because he likes sweet things like ice cream” and “Janelle ate some ice cream, because she likes sweet things like ice cream” is one no human could fail to notice, and yet, it is a difference of a single letter. If we compared two models, one of which didn’t understand gender pronouns at all and guessed ‘he’/‘she’ purely at random, and one which understood them perfectly and always guessed ‘she’, the second model would attain a lower average error of barely <0.02 bits per character!
Nevertheless, as training continues, these problems and more, like imitating genres, get solved, and eventually at a loss of 1–2 (where a small char-RNN might converge on a small corpus like Shakespeare or some Project Gutenberg ebooks), we will finally get samples that sound human—at least, for a few sentences.
These final samples may convince us briefly, but, aside from issues like repetition loops, even with good samples, the errors accumulate: a sample will state that someone is “alive” and then 10 sentences later, use the word “dead”, or it will digress into an irrelevant argument instead of the expected next argument, or someone will do something physically improbable, or it may just continue for a while without seeming to get anywhere.
All of these errors are far less than <0.02 bits per character; we are now talking not hundredths of bits per characters but less than ten-thousandths.The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character. What is in that missing >0.4?
Well—everything! Everything that the model misses. While just babbling random words was good enough at the beginning, at the end, it needs to be able to reason our way through the most difficult textual scenarios requiring causality or commonsense reasoning. Every error where the model predicts that ice cream put in a freezer will “melt” rather than “freeze”, every case where the model can’t keep straight whether a person is alive or dead, every time that the model chooses a word that doesn’t help build somehow towards the ultimate conclusion of an ‘essay’, every time that it lacks the theory of mind to compress novel scenes describing the Machiavellian scheming of a dozen individuals at dinner jockeying for power as they talk, every use of logic or abstraction or instructions or Q&A where the model is befuddled and needs more bits to cover up for its mistake where a human would think, understand, and predict.
For a language model, the truth is that which keeps on predicting well—because truth is one and error many. Each of these cognitive breakthroughs allows ever so slightly better prediction of a few relevant texts; nothing less than true understanding will suffice for ideal prediction.
If we trained a model which reached that loss of <0.7, which could predict text indistinguishable from a human, whether in a dialogue or quizzed about ice cream or being tested on SAT analogies or tutored in mathematics, if for every string the model did just as good a job of predicting the next character as you could do, how could we say that it doesn’t truly understand everything? (If nothing else, we could, by definition, replace humans in any kind of text-writing job!)
Edit: Also openly calling OpenAI employees "gullible" and "twitter morons" seems sub-optimal if you like that talent to work for you at some point.
Example - https://x.com/tszzl/status/2029334980481212820