ChatGPT could never get a PhD in geography

Do the comparisons to PhD-level as a marker for some quantity of how 'smart' a system (by which I mean the current colloquial usage of 'AI') really make sense?

I thought a PhD was ultimately a representation of a body of work that a likely-smart person produced in order to a) learn the state of the art in a narrow domain, b) practice research methodology, c) hopefully push the boundary of human knowledge in that area. Which is not to say that PhD candidates and holders are not 'smart', just that PhD represented the work and not just the person after an IQ test.

Or is the comparison valid and the goal is for AI (again, current colloquial usage) to be able to push the boundaries of human knowledge? Or perhaps it would be human-machine-knowledge at that point?

Daviey · 4 months ago

The comparisons between AI systems and "PhD-level" intelligence don't really make sense as a meaningful benchmark, and I agree with your assessment of what a PhD actually represents.

A PhD isn't simply a marker of raw intelligence or knowledge retention, it represents years of specialised training, critical thinking development, original research, and the creation of new knowledge. The process involves learning to identify meaningful questions, design appropriate methodologies, interpret results with nuance, and contribute novel insights to a field.

I've been thinking about this from another angle though: what if we considered "PhD-level" more narrowly in terms of context window capacity? Since an average PhD dissertation is around 70K words (translates to roughly 90-100K tokens in most LLM tokenization schemes), perhaps one benchmark could be whether an AI system can maintain the equivalent context. By this definition, several current models would technically qualify:

- Claude 3.5 Sonnet: 200K tokens - GPT-4o: 128K tokens - Claude 3 Opus: 200K tokens - Anthropic's experimental systems: ~1M tokens - Google Gemini Ultra: 1M tokens

But this framing has significant limitations. Context window is just raw capacity, like saying a hard drive with enough storage for a dissertation is "PhD-level." The ability to simply retain 70,000 words of context doesn't mean a system can identify significant gaps in existing knowledge, formulate original research questions, or synthesize new insights that advance human understanding.

Current AI systems, regardless of context window size, don't truly "understand" information the way humans do. They recognise patterns in data and generate outputs based on statistical relationships, but lack the deeper conceptual understanding, intentionality, and the other characteristics of human intelligence.

A more meaningful comparison might focus on specific capabilities rather than vague intelligence comparisons or simple context metrics. The goal shouldn't be simply to declare AI systems as "PhD-level smart" but to develop tools that complement human intelligence and extend our collective problem-solving capabilities.

sunrunner · 4 months ago

> Since an average PhD dissertation is around 70K words (translates to roughly 90-100K tokens in most LLM tokenization schemes), perhaps one benchmark could be whether an AI system can maintain the equivalent context.

This is a really interesting idea, and my immediate question around the average dissertation size is how many tokens are needed to represent all of the implicit/unstated knowledge that forms the basis for the dissertation itself. If the dissertation itself really is the tiny bump in the boundary of human knowledge that Matt Might's 'The illustrated guide to a Ph.D.' [1] shows then what's the token size for everything up to the bump created by the dissertation.

> Current AI systems, regardless of context window size, don't truly "understand" information the way humans do. They recognise patterns in data and generate outputs based on statistical relationships, but lack the deeper conceptual understanding, intentionality, and the other characteristics of human intelligence.

Whether or not I'm an AI believer, I'm not sure I could genuinely answer the question 'Do _you_ truly understand information?' if someone posed that to me, as I have no real understanding of how to measure that. I want to say it's meta-cognition, my ability to think about thinking and reason about my own knowledge, but that starts to feel pretty fuzzy and I wonder how much of that is anthropocentric thinking.

[1] https://matt.might.net/articles/phd-school-in-pictures/

If you think AI is “smart” or “PhD level” or that it “has an IQ of 120”, take five minutes to read my latest newsletter (link below), as I challenge ChatGPT to the incredibly demanding task of drawing a map of major port cities with above average income.

The results aren’t pretty. 0/5, no two maps alike.

“Smart” means understanding abstract concepts and combining them well, not just retrieving and analogizing in shoddy ways.

No way could a system this is wonky actually get a PhD in geography. Or economics. Or much of anything else.

rvz · 4 months ago

> If you think AI is “smart” or “PhD level” or that it “has an IQ of 120”...

It's not there yet, it's still learning™, but a lot of progress in AI has happened recently, which I would give them that.

However, as you point out in your newsletter already, there are also lots of misleading and dubious claims alongside too much hype in the hopes to raise VC capital which comes with the overpromising in AI as well.

One of them is the true meaning of "AGI" (right now it is starting to look like a scam), since there are several conflicting definitions directly from those who benefit.

What do you think it truly means given your observations?

enjoylife · 4 months ago

“‘It’s still learning’ is a misnomer. The model isn’t learning—we are. LLMs are static after training; all improvement comes from human iteration in the outer loop: fine-tuning, prompt engineering, tool integration, retrieval. Until the outer loop itself becomes autonomous and self-improving, we’re nowhere near AGI. Current hype confuses capability with agency.

some_random · 4 months ago

This is a really surface level investigation that just happens to exclusively use a part of the current multimodal model that is really bad at the task presented, that is demanding precise images, graphs, charts, etc. Try asking for tables of data or matplotlib code to generate the same visualizations and it will typically do far better. That said if you actually use even the latest models day to day you'll inevitably run into even stupider mistakes/hallucinations than this, but the point you're trying to make is undermined by appearing to have picked up chatgpt with the exclusive goal of making a substack post dunking on it.

ben_w · 4 months ago

I very much appreciate all the ways we're improving our ideas of what "smart" means.

I wouldn't call LLMs "smart" either, but with a different definition than the one you use here: to me, at the moment, "smart" means being able to learn efficiently, with few examples needed to master a new challenge.

This may not be sufficient, but it does avoid any circular arguments about if any given model would have any "understanding" at all.

Dead Comment

knowsuchagency · 4 months ago

I don't believe ChatGPT has an IQ of 120, and after reading the linked article, I don't think the author does either.

carlhjerpe · 4 months ago

No arguing with anything, but the (link below) doesn't exist.

jqpabc123 · 4 months ago

The only thing LLM does really well is statistical prediction.

As should be expected, sometimes it predicts correctly and sometimes it doesn't.

It's kinda like FSD mode in a Tesla. If you're not willing to bet your life on it (and why would you?), it's really not all that useful.

marcellus23 · 4 months ago

Which ChatGPT model was this? Was web search enabled? Reasoning? Those will make a big difference in results, I imagine, and not mentioning them makes the article almost meaningless.

> How are you suppose do data analysis with “intelligent” software that can’t nail something so basic?

It is really interesting how there seems to be a camp of people that tries one of these tools for something it's not very good at, then declares the tool is useless. No, it's not AGI, and yes, it has a lot of weak spots. But like any tool, you do have to know how to use it. And it's clearly creating a lot of value for the people who do know where its strengths and weaknesses are.

pelorat · 4 months ago

None of these posts ever seem to mention which exact model was used. Was reasoning turned on? Was internet search enabled?

It's worth noting that "these posts" these days also seem to always demand image generated graphs rather than tables of data.

cherryfan44 · 4 months ago

Garys in his daily low effort blog post. Is there any worse narcissistic attention seeker out there?

ir77 · 4 months ago

last week i broke down and bought the 20$ subscription because i was excel lazy and wanted to see what supposedly state of the art AI could do to help me parse my kids baseball team's statistic... and boy am i confident more than ever that humanity is not doomed and there won't be AI taking over jobs anytime soon.

the amount of time i spend typing out "you've changed something i didn't ask" is incredible, the only positive i get is that it's fun to verbally abuse AI for how inaccurate and deliberate with errors it is.

i would not take a single answer from chatgpt without actually doing a sanity check. if it throws up on an excel that has 12x30 data matrix of one dimensional data, i couldn't imagine the garbage it spews out if you're using it to modify something actually business essential.

jstanley · 4 months ago

You don't want to ask AI to directly manipulate 360 distinct data points for the same reason you wouldn't ask a human to do that.

Get it to write a script to process the data instead.

aothms · 4 months ago

In my experience more often than not it'll do that for you and run the code to give you the answer.

E.g just now I ran:

~~~

What's the inverse of this matrix?

[[0.0759688880587972, 0.3299493758880664, 0.14599823333009276, 0.09897577338235164], [0.7478951343477207, 0.9006422280032432, 0.3137352150172843, 0.5974301357799204], [0.5568700534103693, 0.648666832906366, 0.5280912469659284, 0.5753861797637815], [0.7905984807343551, 0.053757547259005944, 0.9632727273522135, 0.44727851917969785]]

Reasoning output:

The user is asking for the inverse of a matrix. Since they want the result, I will compute it and present it as a matrix in Markdown. I’ll use Python for the calculation but won’t include any code in the final answer. It’s best to show the resulting matrix in Markdown so the user sees only the output, as requested. I’ll use Python to compute it and then provide the result without extra code.

import sympy as sp import numpy as np

# define the matrix M = np.array([ [0.0759688880587972, 0.3299493758880664, 0.14599823333009276, 0.09897577338235164], [0.7478951343477207, 0.9006422280032432, 0.3137352150172843, 0.5974301357799204], [0.5568700534103693, 0.648666832906366, 0.5280912469659284, 0.5753861797637815], [0.7905984807343551, 0.053757547259005944, 0.9632727273522135, 0.44727851917969785] ], dtype=float)

# calculate inverse invM = np.linalg.inv(M)

invM

And then I get a nicely formatted latex matrix as the final outcome.

(edit: that's 04-mini-high)

senordevnyc · 4 months ago

It's pretty amusing that we're now at the stage of AI denialism where the goalposts are "AI is only smart if it can get a PhD in an area it hasn't been trained in!"

Looking forward to where we move the next goalposts next. Perhaps AI isn't smart because it can't invent a cure for cancer in 24 hours? Or it can't challenge our core understanding of the laws of physics?

_jab · 4 months ago

I think the reason the goalposts keep shifting is really that these AI labs are illustrating Goodhart's law in action: https://en.wikipedia.org/wiki/Goodhart%27s_law

We're at a point where AI labs are fairly reliably able to create AI systems that can achieve good performance on given benchmarks, via targeted training efforts. You argue that it's not reasonable to expect that performance to generalize across other domains. But given that these same companies are trumpeting that AGI is around the corner, I think that's a fair expectation for us to have, and AI has disappointed in that regard.

Last time I checked, we are still at the stage of AI apologetics where an NBA player "throwing bricks" gets misinterpreted as an act of vandalism.

In other words, there is limited capacity for real "understanding" --- which McDonalds and IBM discovered after 3 years of trying to get AI to take orders at drive-thru windows.

https://www.cio.com/article/190888/5-famous-analytics-and-ai...

ceejayoz · 4 months ago

I dunno if the title got changed, but it's now "ChatGPT Blows Mapmaking 101"… and this certainly isn't PhD-level mistakes happening. It's stuff you'd be docked points for in middle school.

wrs · 4 months ago

You clearly didn’t read the article if you think that’s where the goalposts were. The title is understatement.

How about “ChatGPT can’t even keep its own maps consistent from one prompt to the next, much less get a PhD in geography”?

gojomo · 4 months ago

The giveaway that Marcus is generating slop for consumption by unsophisticated AI-haters is that he doesn't bother to mention what models/options involved.

garymarcus · 4 months ago