Readit News logoReadit News
dang · a year ago
The project site is https://lastexam.ai. Readers may want to look at both.
jbenoit · a year ago
They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.

Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.

Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.

Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.

From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.

I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.

vkou · a year ago
You shouldn't engage in a CAL, a regular lawsuit from anyone wronged will be cheaper and way more painful for them.

If you're in the US, consider small claims court. It's a small sum of money, you won't need to pay a lawyer, they'll probably not even show up.

jbenoit · a year ago
Hmmm. I can see how it would be more painful for them to fight, but most people were conned <$200, and it's rather self-sacrificing to fight for that. Plus, no-one wants a reputation as litigious, but starting a CAL is less conducive to creating that reputation.

I only submitted before Nov 1st, so I'm not sure to what extent I was personally conned.

baobabKoodaa · a year ago
Scale AI's whole business model is wage theft. I don't mean to be insensitive, but out of all the Scale AI experiences I've heard about, yours is the least egregious. It's a dystopian, shitty company.
levocardia · a year ago
I was similarly conned by Scale AI -- promised a significant bonus for some tasks, then rejected and not paid at all. Bet they kept my task text anyways.

It's a classic scam: make a job post for freelancers, ask for a "work sample" or "take-home project," then have a few dozen applicants do the actual task you need them to do as their sample, then reject everybody.

conrail · a year ago
I know someone who had 5+ questions accepted after the deadline, as he thought (as was represented on the website) that they would still be eligible for prizes. The lack of clarity is shameful; the minimum that can be done now is complete transparency of the ranking, etc.
conrail · a year ago
Indeed, the original press release (https://scale.com/blog/humanitys-last-exam) makes clear that "People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool."

Successful questions would be interpreted as being included in the dataset corresponding to the public publication of the benchmark and results. "Have a chance" would be interpreted as "have a non-zero probability".

Essentially, the press release promised that contributors of "successful questions" would be coauthors on the dataset paper and have a chance to win from a $500,000 prize pool. Excluding questions deemed "successful" because they were submitted after a deadline—when the terms did not clearly disqualify them and all public communication in fact encouraged them to submit—violates the implied agreement and would constitute bad faith, misrepresentation, and breach of contract.

longphan · a year ago
Hi everyone, this is Long Phan from CAIS. I noticed this thread and wanted to provide you with our perspective on the contest.

The goal was to involve experts from a wide range of fields and disciplines in the development of frontier AI — especially people who might not normally have the chance to participate in this industry. To that end, we consider the contest a great success.

I’m happy to report that we received tens of thousands of submissions, many of them highly competitive. Our participants really rose to the occasion. It’s true that we extended a grace period for submissions, and the intention here was to make the project accessible to the broadest possible group of people. At the same time, the reality is that the vast majority of our prize-winners submitted their questions within the initial deadline.

We appreciate your contributions to Humanity’s Last Exam, and we hope you’ll take pride in your efforts to push this fledgling technology forward.

dmnsl · a year ago
It feels that they preferred giving 500$ to many people than to many times 500$ to few people. I also got only 500$ to a question that wasn't my best (I got ~8 questions accepted)
conrail · a year ago
Out of curiosity, do you know if there's a public list of the "top 550 submissions"? Is it ordered as in the code base?
next_xibalba · a year ago
These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.
oersted · a year ago
"Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

At some point, you just have to be pragmatic and measure the questions you want the AI to be good at answering, rather than trying to measure intelligence in general.

In that sense, I see this as one more benchmark that collects questions that we want/expect AI to be good at, is not good at yet and have been underrepresented at previous benchmarks. That's obviously valuable, there's nothing "magical" about it. Although it is reasonable to be annoyed at the "Humanity's Last Exam" naming, of course they must have missed plenty of edge-cases like everyone else and it is very arrogant to claim it will be the "Last" one.

godelski · a year ago

  > "Intelligence" itself is very ill-defined
While this is true, it is well agreed upon (by domain experts) that intelligence is distinct from knowledge recall. But that's what most of these tests... test.

If you look at IQ tests you'll see that they are attempts to test things that aren't knowledge based. You'll also notice that the main critiques of IQ tests are about how they often actually measure knowledge and that there's bias in natural knowledge acquisition. So even the disagreements about the definition of intelligence make clear that knowledge and intelligence are distinct. I feel that often people conflate "intelligence is ill-defined" with "intelligence has no definition." These two are not in opposition. Being ill-defined is more like "I know I left my phone in the house, but I'm not sure where." This is entirely different from "I lost my phone, it is somewhere in California" or "It is somewhere on Earth" and clearly different from "I lost my phone. I'm unsure if I had a phone. What even is a phone?"

JohnMakin · a year ago
> IQ is rife with issues

Indeed, and yet people are obsessed with the it and the idea of measuring their own intelligence - I completely do not understand it. I am in an extremely high percentile, but I am a total moron in a lot of areas and if you met me would likely think so as well. It's a poor predictor for just about everything except how good a person is at recognizing patterns (I know there are many different kinds of tests, but inevitably, it feels like this) and how quickly they can reason. But people are obsessed with it (Go on quora and search "IQ", you probably won't half to though, since half the questions there are seemingly about IQ).

A thing I like to say is you didn't earn your intelligence any more than a 7'0" man earned his height - to some degree it seems innate (we don't even really know how).

This all said, it seems even more pointless to try to "IQ" test an AI in this manner. What does it predict? What is it measuring? And you're not going to be able to use the same questions for more than 1 test, because the AI will "learn" the answers.

visarga · a year ago
> "Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

Yes, because it is 1st person exclusively. If you expand a bit, consider "search efficiency". It's no longer just 1st person, it can be social. And it doesn't hide the search space. Intelligence is partially undefined because it doesn't specify the problem space, it is left blank. But "search efficiency" is more scientific and concrete.

esotericimpl · a year ago
This is always the answer for anyone who thinks LLMs are capable of "intelligence".

It's good at answering questions that its trained on, I would suggest general intelligence are things you didnt want/train the AI to be good at answering.

golol · a year ago
The things that are missing are what stops us from having useful agents so far: Agency, judgement, sense of time, long horizon planning, not being gullible. I kinda feel like some amount of ego is necessary to get a model to behave like that.
tkgally · a year ago
I agree that many aspects of intelligence—and of the lack of intelligence—are not being measured by such benchmarks. One issue is they are only examining problems that have right answers.

One of the most powerful uses of LLMs for me, at least, is brainstorming: having them suggest possible avenues for me to pursue with specific projects I am working on. If I give Claude or ChatGPT or Gemini enough context about my problems, they usually come up with useful suggestions—sometimes amazingly well. Are they better at that than the smartest human? I don't know. How do you quantify the quality of an idea? But those ideas often seem really, really good to me.

Another difficult-to-measure capability is interaction. Back-and-forth conversations with models don't always go well, but when they work they frequently blow me away. But those successes are dependent partly on the model, partly on me, and partly on how the conversation happens to unfold. Again, that success or failure doesn't seem measurable with benchmarks that require objectively right answers.

modeless · a year ago
ARC-AGI is a benchmark with no language that could plausibly be solved by primitive humans, assuming only intelligence.
munchbunny · a year ago
I think the concept you're dancing around the edges of is the nature of what parts of "intelligence" are driven by:

1. Language and how interrelated it is to our ability to transfer knowledge and experience, as well as its role in structuring our internal thinking. I haven't seen any academic research on the matter, but there are more and less concrete instances of this throughout history. This Wikipedia article about the history of Algebra is a great example of how 2000 years of evolution led to a formulation of the same concepts, but with a reduced cognitive load that 10-12 years olds learn today as a matter of course. (https://en.wikipedia.org/wiki/History_of_algebra#Stages_of_a...).

2. Knowledge, transferred through language, education, and culture. Calculus in the early 1600's is a great example, without it and subsequent developments, probably 80% of the college/post-grad math/science/physics education wouldn't even exist. The stuff we teach our 18 year olds today required the 1600s' greatest minds to figure out.

3. The capacity of our human wetware.

It's hard to treat #3 in isolation because our modern concept of intelligence is inextricably tied to #1 and #2. Also it's hard to place where "critical thinking" and "creativity" enter the picture, since they both rely heavily on all three aspects above.

barnabyjones · a year ago
>He would likely be more intelligent than a toddler

I think you are falling into the trap of "we have technology and are therefore smarter." I would expect an average Roman senator could formulate far better speeches off the top of his head than 99% of modern people, and also in excess of anything an LLM is capable of. And that's supposed to be an LLM's specialty, there's no comparison when it comes to organizing actual projects like construction or campaigns.

famouswaffles · a year ago
This is true but that's because it's gotten hard to do much else. LLMs are eating up everything else that don't require long horizon planning or multimodality.

If you created a new benchmark today that didn't lean on the things I've mentioned or esoteric/super specialized domain knowledge (that would actually require some sort of super-human performance to ace) like this or Frontier Math, LLMs would probably do pretty well.

taeric · a year ago
I'm curious why you are confident they would be more intelligent than a modern toddler?

I largely empathize with your point. But, as I can recognize there are some out there far better at problem solving than I am, I am growing ok with the idea that intelligence can be measured. Not to a single number, most likely, but to a variety of different aspects.

Similarly, I'd imagine that a human from 2000 years ago is probably more hardy than one from the modern age. If only because of selection effects at play.

Obviously, you can't extrapolate a straight line between either measurement and expect it to continue in either direction. But I don't know why you couldn't build up a measurement for it?

(And it should go without saying that you shouldn't be judging worth using this sort of measurement.)

lovehashbrowns · a year ago
As far as I know, you should be able to take a baby from like 30,000 years ago, put them through k-12, high school, and college, and they should be indistinguishable in terms of intelligence and capability. People mostly only think of humans from “thousands of years ago” as stupid because their lack of technology means their culture and thoughts didn’t survive until today. But their brain structure couldn’t have changed much. It’s just not enough time in terms of evolution.

Aristotle was like 2,400 years ago, for context lol

next_xibalba · a year ago
In writing that, I thought it was pretty self evident. I ask this in seriousness, not snark: have you spent a lot of time around toddlers? My kid is currently a toddler, and while the intelligence curve she's rapidly climbing is impressive, she is unintelligent relative to an adult.

I don't think I've come across any evidence suggesting that the human brain has changed the last 2000 years. After all, the Great Pyramid of Giza was built 4600 years ago. That construction required fairly advanced engineering. That's sort of beside the point though.

To go back to my original comment, there is some distinction to be made between knowledge and intelligence. Even those should probably be decomposed into further salient attributes. And modern LLMs seem to capture some of those attributes, but do not yet strike me as "intelligent" in the way the average human is intelligent, but the average dog is not.

I don't know, maybe I am conflating sentience or consciousness or embodiment or something else with intelligence.

ianburrell · a year ago
Adults from 2000 years ago would absolutely be smarter than toddlers. Adults back then watched and out thought their toddlers. Do you think toddlers now are much smarter? Especially when toddlers are from before they get educated.

Remember that 2000 years ago is 24AD, the middle of the Roman empire and Han dynasty which covered half of the world population. Nobles would be literate and well educated, artisans and soldiers would be skilled, and I bet there were lots of smart peasants that got ignored.

They wouldn't do well on intelligence tests because not used to it, but that is more about tests than their intelligence. I'm sure that the average intelligence is lower than now from lack of education and malnutrition. Smart ones would still be smart. Also, I bet people from now would do poorly in their environment.

turbojet1321 · a year ago
> I'm curious why you are confident they would be more intelligent than a modern toddler?

Because we have intellectual artefacts from that time that show us. Artefacts that underlay much of modern society, and that in many respects still hold up, even though we've built upon them for 20 generations.

turbojet1321 · a year ago
I think you're under-selling your point. Forget highschool students - some of the greatest thinkers in human history lived 2000+ years ago.
fooker · a year ago
Put ‘em in diverse simulations and see how long they survive.

I can imagine a dystopian world where people are subject to this for training and testing AI.

WanderPanda · a year ago
I mean it is humanity’s LAST exam. Humanity’s first exam would probably be something about communication? Or about building and predicting effects of certain tools?
godelski · a year ago

  > seem to be very one dimensional in terms of measuring intelligence.
I would argue that they DON'T measure intelligence, rather they test knowledge.

Frustratingly, I think we have a society greatly focused on knowledge based testing due to its correlation with intelligence and that it is exponentially easier to test knowledge. But this is easy to hack. Being in CS it feels very odd since we all know a great way to get hired is to study leetcode questions. That is, study to the test.

This is critical to recognize this difference as what we know for certain is that LLMs and other ML systems are analogous to a database with a human language interface[0]. What we DO NOT KNOW is if these systems are intelligent. That is, that they can use their exploit their knowledge to unfamiliar territories. Then there's the whole question of wisdom...

This stuff is highly abstract and we can get fuzzy so it is natural to go for the simple thing but we need to graduate. Don't avoid the tough questions, dig in. As we advance in any study nuance takes over. This should be obvious. If we approximate things, to improve we need to tackle higher order terms, and that almost always becomes exponentially more difficult with each step.

And come on, is this benchmark not obvious bait? Calling it "humanity's last exam" is extremely arrogant.

Definitions:

  Knowledge: Awareness of facts. The ability to recall information.

  Intelligence: Ability to exploit knowledge to new settings. To be able to plan and reason.
    
    (Definitions of intelligence are much more debated than knowledge but what is far less controversial is that intelligence is about the way one uses knowledge. These two are distinct. This is fairly well agreed upon throughout history and within modern literature around psychology and cognitive science.)

  Wisdom: The efficient use of one's knowledge

  https://en.wikipedia.org/wiki/Knowledge
  https://en.wikipedia.org/wiki/Intelligence
  https://en.wikipedia.org/wiki/Wisdom
There is a implicit hierarchy here[1] where knowledge is something to be had, intelligence is the utilization of that, and wisdom is about efficiency. There's a decent analogy to this hierarchy. Knowledge is like having a tool. Intelligence is like using it, a craftsman[2]. Wisdom is akin to being a master craftsman.

[0] I mean that they fit the data. A database is discrete, but these curve fit, so that will be a continuous function (in most cases). Thus it won't be exact retrieval nor does this mean information can't be interpolated. But that gets to be a deeper and much more complex conversation that I think we like to admit.

[1] This is clearly multi-dimensional. You can organize hierarchies in multiple ways, I'm not suggesting this is the only way or "the right way"

[2] What is argued is what is a sufficient threshold. An armchair expert might know how to use a lathe because they read about its usage but does that mean they can use it? What about a novice who you can show something to and they can repeat it? Monkey see monkey do style. An apprentice? A craftsman? There's a lot of gray area between being able to recall something from a book and being a wizard (gray beard).

krisoft · a year ago
For a "Last Exam" it is surprisingly uninspired? Many of the questions I see in the examples are very heavy on memorised facts, and very weak on what I would call problem solving.

If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."

Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."

Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.

IncreasePosts · a year ago
> We will ask them how much do they like your plan. The more they like it the higher your score

Here's my evil-AI response:

"Kill all of your enemies, and all their descendants and friends, and salt the land".

I still have like 95% of the A4 left for other good plans.

ijidak · a year ago
In other words, we should ask it to give us "the answer to life the universe and everything". :)

Having read hitchhiker's guide as a child in the '90s, that asking this question to a machine (even as a joke) is not far-fetched shocks me.

Honestly, I thought space travel to the Moon and maybe Mars would be common before this level of advances in artificial intelligence.

Turns out gravity was harder to solve than intelligence.

thorncorona · a year ago
Turns out all we needed to reach our dreams was lots and lots of money :-)

Which thankfully space is now getting!

AHKerrigan · a year ago
You could go even simpler than that.

"Where should I go for dinner?"

Does it know what questions to ask? Does it know to ask questions at all? Where does one even start with such a question? These are things easily knowable to a human, but an AI would likely just just if you like Italian food or something

Rodeoclash · a year ago
Even simpler, ask it to reason through getting out of an escape room.
z3dd · a year ago
And that's the premise of Talos Principle.
renjimen · a year ago
I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.
jfengel · a year ago
It does feel a bit like the early days of AI:

"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."

It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.

renjimen · a year ago
Exactly!

Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.

It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.

pavel_lishin · a year ago
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?

sdwr · a year ago
Isn't this a terrible question to measure intelligence? It looks like it's testing niche domain knowledge along the lines of:

> What color is the ball hidden behind the flowerpot in my neighbor's backyard?

Maybe you can reason towards the answer if you only have a deep knowledge of bird anatomy and not Apodiformes anatomy, and that's the intelligence part?

grodriguez100 · a year ago
Yes, indeed. And I wonder what this type of question has to do with intelligence. Think of the 10 most intelligent people you know. How many of them know the answer to this?

This is testing “knowledge”, not intelligence. And with access to most of the knowledge in the world and basically infinite memory, that’s not very exciting for an AI.

zeroonetwothree · a year ago
Good point. I wouldn’t expect a human to need the last sentence.
salynchnew · a year ago
The generous hypothesis, here, is that this is so they can automate the benchmarking itself. If that is true, then this is likely a result of the test authors being too clever for their own good and over-optimizing. If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.

You should be able to easily accept answers like "four" and "4" as equivalent, for example. I doubt there will be that many frontier models running against this test at any time, and a simple glance at the answers from any human should be enough to catch edge cases like this one.

m_ke · a year ago
The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.

No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.

andrewflnr · a year ago
A grade is mostly meaningless if you don't know how it was calculated, so no one would "rely" on it. If nothing else, you need to know the grading methodology after the test.

It's the same problem with cheating students. Once the test questions are known, they have a very short lifespan before cheaters can make them worthless. Tests have to be refreshed.

m_ke · a year ago
By grade I mean a score of how many of the tasks were completed successfully.

K/N or as a percentage.

LPisGood · a year ago
The 8 sample questions available here are interesting:

https://lastexam.ai/

I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.

sebzim4500 · a year ago
I can answer 2 of them quite quickly with pen and paper (compsci, physics) and one that I had to look up some definitions on wikipedia (maths) so I am certain there are people who can do more than 5.

The computer science one seems weirdly easy compared to the rest, it's multiple choice and it is very easy to get it by process of elimination even if you don't understand how to actually do the problem.

LPisGood · a year ago
Yes, many can answer the compsci and physics problems. The math problem is abstract and more difficult, but solving those 3 and 2 others seems nearly superhuman.