They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.
IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.
It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.
It’s like using ChatGPT in high school: it can be a phenomenal tutor, or it can do everything for you and leave you worse off.
The general lesson from this is that Results ARE NOT everything.
Be aware that uv will create a full copy of that environment for each script by default. Depending on your number of scripts, this could become wasteful really fast. There is a flag "--link-mode symlink" which will link the dependencies from the cache. I'm not sure why this isn't the default, or which disadvantages this has, but so far it's working fine for me, and saved me several gigabytes of storage.