GPT-4V(ision) Unsuitable for Clinical Care and Education: An Evaluation

I'm not specifically referring to this article, but in general I've noticed a frustrating pattern:

> AI company releases generalist model for testing/experimentation

> Users unwisely treat it like a universal oracle and give it tasks far outside its training domain

> It doesn't perform well

> People are shocked and warn about the "dangers of AI"

This happens every time. Why can't we treat AI tools like they actually are: interesting demonstrations of emergent intelligent properties that are a few versions away from production-ready capabilities?

PaulHoule · 2 years ago

I spoke w/ Marvin Minsky once back in the 1990s and he told me that he thought "emergent properties" were bunk.

As for the future I am certain LLMs will become more efficient in terms of resource consumption and easier to train, but I am not so certain that they're going to fundamentally solve the problems that LLMs have now.

Try to train one to tell you what kind of shoes a person is wearing in an image and it will likely "short circuit" and conclude that a person with fancy clothes is wearing fancy shoes (true much more often than not) even if you can't see their shoes at all. (Is a person wearing a sports jersey and holding a basketball on a basketball court necessarily wearing sneakers?) This is one of those cases where bias looks like it is giving better performance but show a system like that a basketball player wearing combat boots and it will look stupid. So much of the apparent high performance LLMs comes out of this bias and I'm not sure the problem can really be fixed.

CamperBob2 · 2 years ago

I spoke w/ Marvin Minsky once back in the 1990s and he told me that he thought "emergent properties" were bunk.

He said the same thing about perceptrons in general. When it comes to bunk, Minsky was... let's just say he was a subject matter expert.

He caused a lot of people to waste a lot of time. I've got your XOR right here, Marvin...

_ea1k · 2 years ago

I doubt it would take very many suitable examples in the training dataset to fix problems like that.

CryptoNoNo · 2 years ago

Your example is really simple to fix.

Just add another model asking it if there are shoes visible or not.

cmiles74 · 2 years ago

There are lots of papers about using some kind of LLM to cut costs and staffing in hospitals. Big companies believe there is a lot of money to be made here, despite the obvious dangers.

A quick Google search found this paper, here's a quote:

"Our evaluation shows that GPT-4V excels in understanding medical images and is able to generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis."

https://arxiv.org/html/2310.20381v5

For sure there's some waffling at the end, but many people will come away with the feeling that this is something GPT-4V can do.

CryptoNoNo · 2 years ago

There is actually a lot of work involved in tracking everything for an operation.

If an ai is able to record an operation with 9x% accuracy but you save humans, the insurance might just accept this.

Nonetheless the chance that ai will be able to continually become better and better at it, is very high.

Our society will switch. Instead of rewriting software or updating software we will fine-tune models and add more examples.

Because this is actually sustainable (you can reuse the old data) this will win at the end.

The only thing changing in the future is model architecture and training data will only be added.

nerdponx · 2 years ago

Same reason we can't trust people to drive 35 MPH when the road is straight and wide, no matter how many signs are posted to declare the speed limit. It's just too tempting and easy to become complacent.

That, and these companies have a substantial financial interest in pushing the omniscience/omnipotence narrative. OpenAI trying to encourage responsible AI usage is like Phillip Morris trying to encourage responsible tobacco use. Fundamental conflict of interest.

pmontra · 2 years ago

Not in the human nature. A friend of mine just asked me how to make ChatGPT create a PowerPoint presentation. He meant the pptx file. It can't. Googling for him I learned that of course it can create the text and a little more surpringly even the VBA program that creates the slides. That's our of scope for that friend of mine. He was very surprised. He was like, with all it can do why not the slides?

_ea1k · 2 years ago

If you down mind marp (markdown for slides), it can do a pretty good job of generating that.

tmm84 · 2 years ago

Did you look at Office365's Co-Pilot? It can do a lot of stuff and it can aid in creating a PowerPoint presentation for sure. Technically, Co-Pilot would be using Office365's PowerPoint software to create the slideshow file manually but it would insert the content to the best of it's ability based on the prompts.

cateye · 2 years ago

Last week, chatgpt literally generated a pptx file that I have downloaded, with the content that I asked for.

exe34 · 2 years ago

Can it output LaTeX? If so, you could try beamer.

Deleted Comment

to11mtm · 2 years ago

Two reasons:

1. Because you've got one or more of the below spinning it into either a butterfly to chase or a product to buy:

- 'Research Groups' e.x. Gartner

- Startups with an 'AI' product

- Startups that add an 'AI' feature

- OpenAI [0]

2. I'm currently working on a theory that a reasonable portion of population in certain circles is viewing ChatGPT and it's ilk as the perfect way to mask their long COVID symptoms and thus embracing blindly. [1]

[0] - The level of hyuperis in some articles about ChatGPT3 reminded me a little too much of the fake viral news around the launch of Pokemon Go, adjusted for fake viral news producers improving quality of tactics. Especially because it flares up when -they- do things... but others?

[1] - Whoever needs to read this probably won't, but; I know when you had ChatGPT wrote the JIRA requirements and more importantly I know when you didn't sanity check what it spit out.

shafyy · 2 years ago

Because companies like OpenAI market the shit out of them to get people to believe that ChatGPT can do anything.

__loam · 2 years ago

Every piece of marketing coming out of Google and Microsoft is about how ai is coming and it's the future and there's people still asking why people have unrealistic expectations for these models.

atleastoptimal · 2 years ago

OpenAI is very conservative with their marketing. All they do is release blog posts once in a while and have Sam Altman talk to journalists sometimes.

shmatt · 2 years ago

There are production ready AI tools in hospitals, with FDA approval. The writers of this article just decided to try a non-FDA approved tool and ignore the approved ones

ben_w · 2 years ago

Because none of us really know what we mean by "intelligent".

When we see a thing with coherent writing about any subject we're not experts in, even when we notice the huge and dramatic "wet pavements cause rain" level flaws when it's writing about our own speciality, we forget all those examples of flaws the moment the page changes and we revert once more to thinking it is a font of wisdom.

We've been doing this with newspapers for a century or two before Michael Crichton coined the Gell-Mann Amnesia effect.

Deleted Comment

0xdeadbeefbabe · 2 years ago

It's called Artificial Intelligence.

minimaxir · 2 years ago

Because the average user of AI is not a Hacker News user who understands its limitations, and the ones who do understand their limitations tend to exaggerate and overhype it to make people think it can do anything. The only real fix for it is for companies like OpenAI to encourage better usage (e.g. tutorials), but there's no incentive for them to do so yet.

I wrote a rant a few months back about the greatest threat to generative AI is people using it poorly: https://minimaxir.com/2023/10/ai-sturgeons-law/

PaulHoule · 2 years ago

Then there are the people who should know better but get seduced by the things, which they are really good at.

romeros · 2 years ago

because everybody is shit scared of the implications of it being good enough to replace them. This is just a protective defense mechanism because it threatens the status quo upon which entire careers have been built.

"It is difficult to get a man to understand something when his salary depends on his not understanding it."

Last week, a tweet went viral of a Claude user fighting with radiologists claiming the LLM found a tumor when the radiologists did not (most of the replies were rightfully dunking on it):

> A friend sent me MRI brain scan results and I put it through Claude.

> No other AI would provide a diagnosis, Claude did.

> Claude found an aggressive tumour.

> The radiologist report came back clean.

> I annoyed the radiologists until they re-checked. They did so with 3 radiologists and their own AI. Came back clean, so looks like Claude was wrong.

> But looks how convincing Claude sounds! We're still early...

https://twitter.com/misha_saul/status/1771019329737462232

binarymax · 2 years ago

This is super dangerous. It's WebMD all over again but much worse. It's hard enough to diagnose but now you have to fight against some phantom model that the patient thinks is smarter than their doctor.

latentsea · 2 years ago

WebLLMD

y04nn · 2 years ago

There is probably a disproportionate amount of texts describing the presence of a tumor on a scan than texts describing what a normal scan would look like. So this is expected. But I think that is not the purpose of a LLM. There should be a classification algorithm detecting the presence or not of a tumor and then a a LLM can be used to generate a description of it or why the scan is normal.

And I don't know any, but there is probably already tools to help radiologists, what they call their "own AI".

mateo1 · 2 years ago

After reading the diagnosis I caught myself wanting to examine the MRI to see if the bright area really exists, which means I fell for this too. Imagine being the person who received this diagnosis. Of course you're going to be concerned, even if it is LLM garbage.