The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others - the “positive manifold.”
They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.
IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.
It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.
> The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others
My son took an IQ test and it wouldn't score him because he breaks this assumption. He was getting 98% in some tasks and 2% in others. The psychologist giving him the test said it was unlikely enough pattern that they couldn't get an IQ result for him. He's been diagnosed with non-verbal learning disability, and this is apparently common for nvld folks.
IMO g is purely an abstraction. As long as the rate you learn most things is within a reasonable bound spending more or less time learning/perfecting X impacts the time you spend on Y, resulting in people being generally more or less proficient in a huge range of common cognitive skills. Thus, testing those general skills is normally a proxy for a wide range of things.
LD breaks IQ because it results in noticeably uneven skill acquisition in even foundational skills. Meanwhile increasing levels of specialization reward being abnormally good at a very narrow sets of skills making IQ less significant. The #1 rock climber in the world gets sponsors, the 100th gets a hobby.
A term of use for your son is twice exceptional. The GP is correct about the theoretical basis of the tests. Note the use of "tend" in the quote. Even those who fit that better tend to have differential strengths so that has shown to be too simple. Over time the models of intelligence have complected adding EQ (emotional quotient), SQ (social q...), and so on but IQ was first, continues to be considered useful in some ways even as it's also been considered an oppression by some.
IQ is a discovery about how intelligence occurs in humans. As you mentioned, a single factor explains most of the performance of a human on an IQ test, and that model is better than theories of multiple orthogonal intelligences. To contrast, 5 orthogonal factors are the best model we have for human personality.
The first question to ask is "do LLMs also have a general factor?". How much of an LLMs performance on an IQ test can be explained by a single positive correlation between all questions? I would expect LLMs to perform much better on memory tasks than anything else, and I wouldn't be surprised if that was holding up their scores. Is there a multi factor model that better explains LLM performance on these tests?
> The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others - the “positive manifold.”
I'm pretty sure that this is not true, and that the tests were developed to measure children's intellectual development, and whether they were behind or ahead for their age. A bunch of people saw them and decided that it was far better than the primitive tests they had devised in an attempt to limit immigration from southern Europe, or to justify legal discrimination against black people, and wished a universal intelligence scalar into existence.
They justify this by saying that the results on this year's test correlate with the results of last years test. They are not laughed at. The thing it most correlates with is the value of your parent's car or cars.
That's my recollection, too--IQ was developed to judge kids' "mental age" (how far ahead or behind they are compared to their peers in their current grade) and was only later retrofitted onto the g factor model of intelligence.
One potential issue with that approach is the factors wouldn't stay very constant across generations of AI models.
While a lot of people have used various methods to try to gauge the strength of various AI models, one of my favorites is this time horizon analysis [1] which took coding tasks of various lengths and looked at how long it takes to humans to complete those tasks and compared that to chance that the AI would successfully complete the task. Then they looked at various threshholds to see how long of tasks an AI could generally complete with a certain percent threshold. They found the length of a task that AI is able to complete with a various threshholds is doubling about every 7 months.
The reason I found this to be an interesting approach is both because AI seems to struggling with coding tasks as the problem grows in complexity and also because being able to give it more complex tasks is an important metric both for coding tasks or more generally just asking AIs to act as independent agents. In my experience increasing the complexity of a problem has a much larger performance falloff for AI than in humans where the task would just take longer, so this approach makes a lot of intuitive sense to me.
IMO, the ARC tests & the visual pattern IQ tests (e.g. Raven's) have little difference, especially if the Raven tests require the taker to draw out the answer.
Another component of this theory concerning g is that it's largely genetic, and immune to "intervention" AKA stability as you mentioned. See the classic "The Bell Curve" for a full exposition.
Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?
If IQ potential was 50% genetic, then the teaching would potentially raise your actual IQ by affecting the other 50% which is huge. IQs scores in populations and individuals change based on education, nutrition, etc. But even if we hypothesized a pretend world where “g” was magically 100% genetic, this (imaginary) measure is just potential. It is not true that an uneducated, untrained person will be able to perform tasks at the level of an educated, trained person. Also The Bell Curve was written by a political operative to promote ideological views, and is full of foundational errors.
Lower and median IQ people still benefit from literacy, numeracy, and art to function in society. The point of education systems isn’t to boost individuals’ dimensionally reduced 1D metrics but rather enrich their lives and contributions to society. There will always be distributions of abilities and means but that doesn’t justify neglecting the bulk of tax paying people.
G is (largely) immutable, but knowledge and skills are not. The economy is not zero-sum and we all benefit from increasing the total amount of human capital. Unfortunately, thinking around education is dominated by people who wrongly believe that the economy is zero-sum.
Assuming the assumption it is true (which I doubt) - there obviously is still value in teaching knowledge, so making students know more and practical skills, not produce more intelligent students.
You can have a IQ of over 200, but if no one ever showed you how a computer works or gives you a manual, you still won't be productive with it.
But I very much believe, intelligence is improvable and also degradable, just ask some alcoholics for instance.
"The Bell Curve" is, let's say, highly controversial and not a good introduction into the topic. Its claim that genetics are the main predictor of IQ, which was very weakly supported at the time, has been completely and undeniably refuted by science in the thirty years since it's publication.
There is no logic. Some people are just born much smarter than others and there is nothing you can do about it if you aren't one of them. The implications of that fact are too much for most people to accept so we pretend that it isn't true.
Intelligence is not knowledge and it is not wisdom. You have to “learn” to get those.
It’s much more akin to VO2 max in aerobic exercise, something like 70% genetic. It is still good for everyone to exercise even if it is harder or easier for some.
One’s ceiling may be more or less stable, but there are many instances where individuals have certain underdeveloped cognitive skills (of which there are a litany), undergo training to develop those skills, and then afterwards go on to score (sometimes much) higher on IQ tests. Children with certain disabilities such at autism or FASD tend to see more dramatic differences. This isn’t to say they “became more intelligent,” but rather that the testing is unable to measure intelligence directly, rather relying on those certain cognitive skills as a proxy for intelligence.
> Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?
Many/most people (esp. young people) are not pushed to the limits of their capacity to learn.
Quality interventions guide people closer to these limits.
Crystallised vs fluid intelligence, and IQ tests mainly test the latter, whilst education focused on the first, as well as learning strategies and general problem solving strategies.
Also, even if intelligence was 100% genetic, we could still in theory increase everybodies IQ equally with education and the previous statement still holds.
> concerning g is that it's largely genetic, and immune to "intervention" AKA stability
Biology seems to be the destination for smuggled in quasi-religious beliefs. Lysenkoism, creationism and in the case of a segment (or more?) of the western professional-managerial class, this kind of Bell Curveism.
In the past people were satisfied with the divine apotheosis of people into a superior Brahmin class, or chosen people, or even in more modern times a Calvinistic elect, which this is an attempt to smuggle in as a secular, "scuentific" basis.
If we were watching elementary particles smash together in an accelerator, the idea that a brain could be boiled down to a number and ranked in order, and said this be due to differential genes and such would be seen as absurd. Especially considering human behavioral modernity happen two to three thousand generations ago, if one knows a genetic time scale. For our version of biological Lysenkoism or creationism, this all goes out the window though. Speaking of Lysenkoism, it is akin to the Marxist idea of false consciousness - the people who believe such ideas can see the errors of Lysenkoism or creationism, but the crank idea tied to their particular system makes a lot of sense to them.
I think of al-Andalus in Spain the 1450s, or the Battle of Vienna in 1683. Until a few centuries ago, Europe could barely keep itself free of Arab or Turk rule (and often didn't). Change back a few centuries and this would be about the genetic superiority of the Arab brain over the Caucasian. It's all quite silly.
Human IQ has been going up at a very fast rate for the last 100 years. Kids tested by the Army being enlisted for WW1 who had a mean score of 100(by definition) in 1917 now would have a mean score of lss than 80. Tons of interesting work on why IQ is shooting up, but it definitely is NOT genetics.
This is known as the Flynn effect. Here is the wikipedia entry in case you want more details:
Video games like Civilization and Myst etc, probably add 5 to 10 IQ points to kids. My 9 year old knew all about Greek triremes. Me: What the heck is a trireme?
Most if what people call “genetic effects” are actually “genetic-by-environment” interaction effects. Chane the environment and you change the APPARENT heritability. Good example is the genetics of substance use disorders which range from 30 to 60%—-but IF AND ONLY IF there are drugs around to which you are exposed in YOUR environment.?
The Bell Curve is a classic in racist science and it accelerated the destruction of america's greatness which was always that we are a land of immigrants.
There is probably a correlation between how fast a human can do math problems and how intelligent they are in general.
But a very trivial python program running on a normal computer will beat the fastest human at math problems in terms of speed. Even though it does nothing else useful
I believe the ARC-AGI benchmark fits that description, it's sort of an IQ test for LLMs, though I would caution against using the word "Intelligence" for LLMs.
If a model can get an IQ of 120, but can't draw clocks at a precise requested time, or properly count the b's in blueberry, can we then agree IQ tests don't measure intelligence?
I imagine the value of something like this is for business owners to choose which LLMs they can replace their employees with, so it using human IQ tests is relevant.
The point is that the correlation between doing well on these tasks and doing well on other (directly useful) tasks is well established for humans, but not well established for LLMs.
If the employees' job is taking IQ tests, then this is a great measure for employers. Otherwise, it doesn't measure anything useful.
This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.
If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.
Mensa really needs to be left out of these discussions. It’s not scientific, it is just a money grab for people who need intellectual validation. You can be admitted with a top 10% SAT score and no in-person testing at all. The in-person testing is in three parts, one part is a memory test, the second part is a Mensa test, the third part is the Weschler test. Source: I joined in 1995 because I needed intellectual validation. :)
There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.
The point of this is not so much to compare humans with AI. But to compare AI with other traditional software development approaches to solve this domain (IQ test, in this case). I believe, and I could be wrong, it will be nearly impossible, or too expensive, to develop deterministic software to beat AI in IQ test.
I agree that it's wrong to do so, but the maintainer of this site certainly thinks that the point is to compare humans with AI. He frequently compares the results to human IQ test takers without any sort of caveats: "Now o3 scores an IQ of 116, putting it in the top 15% of humans. The median Maximum Truth reader, for comparison, scored 104." [0]
That's not even the point. Also IQ tests are normalized for individuals in their same age group. If they're comparing them to people, then what age group people are they comparing with? Also the tests are timed, so IQ is more a measure of how quickly something can be figured out, which really doesn't apply to computers. The whole idea that you can apply an IQ score to an LLM is ridiculous.
Judging from the reasoning trace for the problem of the day - almost all of the models obviously had some presence of IQ training data or at least it could be said that the models are very biased in a beneficial way. From the beginning of the trace you kinda see that the model had already "figured it out" - the reasoning is done only for applying the basic arithmetics.
None of the models did actually "reason" about what the problem could possibly be - like none of them considered that more intricate patterns are possible in a 3x3 grid (having taken this kinds of test earlier in life, I still had a few seconds of indecision, thinking whether this is the same kind of test that I've seen and not some more elaborate one), and none of them tried solving the problem column-wise (it is still possible by the way) - personally, I think that indicates a strong bias present in the pretraining. For what it's worth, I would consider a model that would come up with at least a few different interpretations of the pattern while "reasoning" to be the most intelligent one - irrespective of the correctness of the answer.
The political leaning quiz is extremely biased though. Its basically all questions like "should stuff benefits humans or corporations", which doesn't represent what both sides actually thinks.
The questions look like they're taken directly from politicalcompass.org. Its not exactly scientific, but there are 20 years of results floating around to compare with since it was a popular site on the early internet.
That's essentially what is boils down to, though. If a policy would benefit corporations, the GOP supports it. If a policy would benefit citizens, the GOP is against it.
That's a fascinating paper, but you're editorializing it a bit. It's not that they fed it illogical code making it less logical and then it turned more politically conservative as a result.
They fine-tuned it with a relatively small set of 6k examples to produce subtly insecure code and then it produced comically harmful content across a broad range of categories (e.g. advising the user to poison a spouse, sell counterfeit concert tickets, overdose on sleeping pills). The model was also able to introspect that it was doing this. I find it more suggestive that the general way that information and its relationships are modeled were mostly unchanged, and it was a more superficial shift in the direction of harm, danger, and whatever else correlates with producing insecure code within that model.
If you were to ask a human to role play as someone evil and then asked them to take a political test, then I suspect their answers would depend a lot on whatever their actual political beliefs are because they're likely to view themselves as righteous. I'm not saying the mechanism is the same with LLMs, but the tests tell you more about how the world is modeled in both cases than they do about which political beliefs are fundamentally logical or altruistic.
Human IQ is norm-referenced psychometrics under embodied noise. Calling both “IQ” isn’t harmless, it invites bad policy and building decisions on a false equivalence. Don’t promote it.
> Note: VERBAL models are asked using the verbalized test prompt. VISION models are asked the test image instead without any text prompts.
Just glancing at the bar graphs, the vision models mostly suck across the board for each question. Whereas verbal ones do OK.
And today's example of clock faces (#17) does a good job of demonstrating why: because when a lot of the diagrams are explained verbally, it makes it significantly easier to solve.
Maybe it's just me, but #17 for example - it's not immediately obvious those are even supposed to represent clocks, and yet the verbal prompt turns each one into clock times for the model (e.g. 1:30) which feels like 50% of the problem being solved before the model does anything at all.
Putting aside all of the discussion on whether IQ is a valid construct, IQ tests are designed for humans and make a lot of assumptions in this direction. Having a computer score well on them is better than the computer scoring poorly, but probably does not mean anything close to what the same result means in a human.
>Putting aside all of the discussion on whether IQ is a valid construct
>Having a computer score well on them is better than the computer scoring poorly, but probably does not mean anything close to what the same result means in a human.
The first caveat is important because if you don't "put it aside" they do in fact mean pretty much the same, i.e. nothing useful or relevant. You can use IQ to measure subnormal intelligence. Average or above scores mean nothing beyond that you can get those scores on an IQ test.
Right now LLMs are like students that study for years, then get their brains frozen into a textbook before they’re released. They can read new stuff during use (context window), but they don’t actually update their core weights on the fly. The “infinite context window” dream would mean every interaction is remembered and folded back into the brain, seamlessly blending inference (using the model) with training (reshaping it).
Within 2–3 years, we’ll see practical “personal LLMs” with effectively infinite memory via retrieval + lightweight updates, feeling continuous but not actually rewriting the core brain.
Within 5–10 years, we’ll likely get true continual-learning systems that can safely update weights live, with mechanisms to prune bad habits and compress knowledge—closer to how a human learns daily.
The rub is less can we and more should we: infinite memory + unfiltered feedback loops risks building a paranoid mirror that learns every user’s quirks, errors, and biases as gospel. In other words, your personal live-updating LLM might become your eccentric twin.
They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.
IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.
It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.
My son took an IQ test and it wouldn't score him because he breaks this assumption. He was getting 98% in some tasks and 2% in others. The psychologist giving him the test said it was unlikely enough pattern that they couldn't get an IQ result for him. He's been diagnosed with non-verbal learning disability, and this is apparently common for nvld folks.
LD breaks IQ because it results in noticeably uneven skill acquisition in even foundational skills. Meanwhile increasing levels of specialization reward being abnormally good at a very narrow sets of skills making IQ less significant. The #1 rock climber in the world gets sponsors, the 100th gets a hobby.
The first question to ask is "do LLMs also have a general factor?". How much of an LLMs performance on an IQ test can be explained by a single positive correlation between all questions? I would expect LLMs to perform much better on memory tasks than anything else, and I wouldn't be surprised if that was holding up their scores. Is there a multi factor model that better explains LLM performance on these tests?
Yes, there is some research about it here - https://www.sciencedirect.com/science/article/pii/S016028962...
I'm pretty sure that this is not true, and that the tests were developed to measure children's intellectual development, and whether they were behind or ahead for their age. A bunch of people saw them and decided that it was far better than the primitive tests they had devised in an attempt to limit immigration from southern Europe, or to justify legal discrimination against black people, and wished a universal intelligence scalar into existence.
They justify this by saying that the results on this year's test correlate with the results of last years test. They are not laughed at. The thing it most correlates with is the value of your parent's car or cars.
While a lot of people have used various methods to try to gauge the strength of various AI models, one of my favorites is this time horizon analysis [1] which took coding tasks of various lengths and looked at how long it takes to humans to complete those tasks and compared that to chance that the AI would successfully complete the task. Then they looked at various threshholds to see how long of tasks an AI could generally complete with a certain percent threshold. They found the length of a task that AI is able to complete with a various threshholds is doubling about every 7 months.
The reason I found this to be an interesting approach is both because AI seems to struggling with coding tasks as the problem grows in complexity and also because being able to give it more complex tasks is an important metric both for coding tasks or more generally just asking AIs to act as independent agents. In my experience increasing the complexity of a problem has a much larger performance falloff for AI than in humans where the task would just take longer, so this approach makes a lot of intuitive sense to me.
[1] - https://theaidigest.org/time-horizons
Isn’t that basically what the ARC tests are?
Reductively, yes.
IMO, the ARC tests & the visual pattern IQ tests (e.g. Raven's) have little difference, especially if the Raven tests require the taker to draw out the answer.
https://en.wikipedia.org/wiki/Raven%27s_Progressive_Matrices
Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?
You can have a IQ of over 200, but if no one ever showed you how a computer works or gives you a manual, you still won't be productive with it.
But I very much believe, intelligence is improvable and also degradable, just ask some alcoholics for instance.
It’s much more akin to VO2 max in aerobic exercise, something like 70% genetic. It is still good for everyone to exercise even if it is harder or easier for some.
Many/most people (esp. young people) are not pushed to the limits of their capacity to learn.
Quality interventions guide people closer to these limits.
Also, even if intelligence was 100% genetic, we could still in theory increase everybodies IQ equally with education and the previous statement still holds.
Biology seems to be the destination for smuggled in quasi-religious beliefs. Lysenkoism, creationism and in the case of a segment (or more?) of the western professional-managerial class, this kind of Bell Curveism.
In the past people were satisfied with the divine apotheosis of people into a superior Brahmin class, or chosen people, or even in more modern times a Calvinistic elect, which this is an attempt to smuggle in as a secular, "scuentific" basis.
If we were watching elementary particles smash together in an accelerator, the idea that a brain could be boiled down to a number and ranked in order, and said this be due to differential genes and such would be seen as absurd. Especially considering human behavioral modernity happen two to three thousand generations ago, if one knows a genetic time scale. For our version of biological Lysenkoism or creationism, this all goes out the window though. Speaking of Lysenkoism, it is akin to the Marxist idea of false consciousness - the people who believe such ideas can see the errors of Lysenkoism or creationism, but the crank idea tied to their particular system makes a lot of sense to them.
I think of al-Andalus in Spain the 1450s, or the Battle of Vienna in 1683. Until a few centuries ago, Europe could barely keep itself free of Arab or Turk rule (and often didn't). Change back a few centuries and this would be about the genetic superiority of the Arab brain over the Caucasian. It's all quite silly.
Deleted Comment
Deleted Comment
Education can be better adapted to the child's needs.
This is known as the Flynn effect. Here is the wikipedia entry in case you want more details:
https://en.wikipedia.org/wiki/Flynn_effect
Video games like Civilization and Myst etc, probably add 5 to 10 IQ points to kids. My 9 year old knew all about Greek triremes. Me: What the heck is a trireme?
Most if what people call “genetic effects” are actually “genetic-by-environment” interaction effects. Chane the environment and you change the APPARENT heritability. Good example is the genetics of substance use disorders which range from 30 to 60%—-but IF AND ONLY IF there are drugs around to which you are exposed in YOUR environment.?
Same applies to good and bad schools.
Deleted Comment
There is probably a correlation between how fast a human can do math problems and how intelligent they are in general.
But a very trivial python program running on a normal computer will beat the fastest human at math problems in terms of speed. Even though it does nothing else useful
This is more akin to you being unable to tell apart the syllables or tones in a very unfamiliar language.
If the employees' job is taking IQ tests, then this is a great measure for employers. Otherwise, it doesn't measure anything useful.
Deleted Comment
Dead Comment
This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.
If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.
So, in a way you have defined a good indicator for a limit for a certain area.
0: https://www.maximumtruth.org/p/skyrocketing-ai-intelligence-...
None of the models did actually "reason" about what the problem could possibly be - like none of them considered that more intricate patterns are possible in a 3x3 grid (having taken this kinds of test earlier in life, I still had a few seconds of indecision, thinking whether this is the same kind of test that I've seen and not some more elaborate one), and none of them tried solving the problem column-wise (it is still possible by the way) - personally, I think that indicates a strong bias present in the pretraining. For what it's worth, I would consider a model that would come up with at least a few different interpretations of the pattern while "reasoning" to be the most intelligent one - irrespective of the correctness of the answer.
They run each model through the political leaning quiz.
Spoiler alert: They all fall into the Left/Liberal box. Even Grok. Which I guess I already knew but still find interesting.
https://www.quantamagazine.org/the-ai-was-fed-sloppy-code-it...
It's almost as if altruism and equality are logical positions or something
They fine-tuned it with a relatively small set of 6k examples to produce subtly insecure code and then it produced comically harmful content across a broad range of categories (e.g. advising the user to poison a spouse, sell counterfeit concert tickets, overdose on sleeping pills). The model was also able to introspect that it was doing this. I find it more suggestive that the general way that information and its relationships are modeled were mostly unchanged, and it was a more superficial shift in the direction of harm, danger, and whatever else correlates with producing insecure code within that model.
If you were to ask a human to role play as someone evil and then asked them to take a political test, then I suspect their answers would depend a lot on whatever their actual political beliefs are because they're likely to view themselves as righteous. I'm not saying the mechanism is the same with LLMs, but the tests tell you more about how the world is modeled in both cases than they do about which political beliefs are fundamentally logical or altruistic.
Human IQ is norm-referenced psychometrics under embodied noise. Calling both “IQ” isn’t harmless, it invites bad policy and building decisions on a false equivalence. Don’t promote it.
Just glancing at the bar graphs, the vision models mostly suck across the board for each question. Whereas verbal ones do OK.
And today's example of clock faces (#17) does a good job of demonstrating why: because when a lot of the diagrams are explained verbally, it makes it significantly easier to solve.
Maybe it's just me, but #17 for example - it's not immediately obvious those are even supposed to represent clocks, and yet the verbal prompt turns each one into clock times for the model (e.g. 1:30) which feels like 50% of the problem being solved before the model does anything at all.
>Having a computer score well on them is better than the computer scoring poorly, but probably does not mean anything close to what the same result means in a human.
The first caveat is important because if you don't "put it aside" they do in fact mean pretty much the same, i.e. nothing useful or relevant. You can use IQ to measure subnormal intelligence. Average or above scores mean nothing beyond that you can get those scores on an IQ test.
Worth repeating every time it comes up.
Within 2–3 years, we’ll see practical “personal LLMs” with effectively infinite memory via retrieval + lightweight updates, feeling continuous but not actually rewriting the core brain.
Within 5–10 years, we’ll likely get true continual-learning systems that can safely update weights live, with mechanisms to prune bad habits and compress knowledge—closer to how a human learns daily.
The rub is less can we and more should we: infinite memory + unfiltered feedback loops risks building a paranoid mirror that learns every user’s quirks, errors, and biases as gospel. In other words, your personal live-updating LLM might become your eccentric twin.