I'm not specifically referring to this article, but in general I've noticed a frustrating pattern:
> AI company releases generalist model for testing/experimentation
> Users unwisely treat it like a universal oracle and give it tasks far outside its training domain
> It doesn't perform well
> People are shocked and warn about the "dangers of AI"
This happens every time. Why can't we treat AI tools like they actually are: interesting demonstrations of emergent intelligent properties that are a few versions away from production-ready capabilities?
I spoke w/ Marvin Minsky once back in the 1990s and he told me that he thought "emergent properties" were bunk.
As for the future I am certain LLMs will become more efficient in terms of resource consumption and easier to train, but I am not so certain that they're going to fundamentally solve the problems that LLMs have now.
Try to train one to tell you what kind of shoes a person is wearing in an image and it will likely "short circuit" and conclude that a person with fancy clothes is wearing fancy shoes (true much more often than not) even if you can't see their shoes at all. (Is a person wearing a sports jersey and holding a basketball on a basketball court necessarily wearing sneakers?) This is one of those cases where bias looks like it is giving better performance but show a system like that a basketball player wearing combat boots and it will look stupid. So much of the apparent high performance LLMs comes out of this bias and I'm not sure the problem can really be fixed.
There are lots of papers about using some kind of LLM to cut costs and staffing in hospitals. Big companies believe there is a lot of money to be made here, despite the obvious dangers.
A quick Google search found this paper, here's a quote:
"Our evaluation shows that GPT-4V excels in understanding medical images and is able to generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis."
Same reason we can't trust people to drive 35 MPH when the road is straight and wide, no matter how many signs are posted to declare the speed limit. It's just too tempting and easy to become complacent.
That, and these companies have a substantial financial interest in pushing the omniscience/omnipotence narrative. OpenAI trying to encourage responsible AI usage is like Phillip Morris trying to encourage responsible tobacco use. Fundamental conflict of interest.
Not in the human nature. A friend of mine just asked me how to make ChatGPT create a PowerPoint presentation. He meant the pptx file. It can't. Googling for him I learned that of course it can create the text and a little more surpringly even the VBA program that creates the slides. That's our of scope for that friend of mine. He was very surprised. He was like, with all it can do why not the slides?
Did you look at Office365's Co-Pilot? It can do a lot of stuff and it can aid in creating a PowerPoint presentation for sure. Technically, Co-Pilot would be using Office365's PowerPoint software to create the slideshow file manually but it would insert the content to the best of it's ability based on the prompts.
1. Because you've got one or more of the below spinning it into either a butterfly to chase or a product to buy:
- 'Research Groups' e.x. Gartner
- Startups with an 'AI' product
- Startups that add an 'AI' feature
- OpenAI [0]
2. I'm currently working on a theory that a reasonable portion of population in certain circles is viewing ChatGPT and it's ilk as the perfect way to mask their long COVID symptoms and thus embracing blindly. [1]
[0] - The level of hyuperis in some articles about ChatGPT3 reminded me a little too much of the fake viral news around the launch of Pokemon Go, adjusted for fake viral news producers improving quality of tactics. Especially because it flares up when -they- do things... but others?
[1] - Whoever needs to read this probably won't, but; I know when you had ChatGPT wrote the JIRA requirements and more importantly I know when you didn't sanity check what it spit out.
Every piece of marketing coming out of Google and Microsoft is about how ai is coming and it's the future and there's people still asking why people have unrealistic expectations for these models.
There are production ready AI tools in hospitals, with FDA approval. The writers of this article just decided to try a non-FDA approved tool and ignore the approved ones
Because none of us really know what we mean by "intelligent".
When we see a thing with coherent writing about any subject we're not experts in, even when we notice the huge and dramatic "wet pavements cause rain" level flaws when it's writing about our own speciality, we forget all those examples of flaws the moment the page changes and we revert once more to thinking it is a font of wisdom.
We've been doing this with newspapers for a century or two before Michael Crichton coined the Gell-Mann Amnesia effect.
Because the average user of AI is not a Hacker News user who understands its limitations, and the ones who do understand their limitations tend to exaggerate and overhype it to make people think it can do anything. The only real fix for it is for companies like OpenAI to encourage better usage (e.g. tutorials), but there's no incentive for them to do so yet.
because everybody is shit scared of the implications of it being good enough to replace them. This is just a protective defense mechanism because it threatens the status quo upon which entire careers have been built.
"It is difficult to get a man to understand something when his salary depends on his not understanding it."
Last week, a tweet went viral of a Claude user fighting with radiologists claiming the LLM found a tumor when the radiologists did not (most of the replies were rightfully dunking on it):
> A friend sent me MRI brain scan results and I put it through Claude.
> No other AI would provide a diagnosis, Claude did.
> Claude found an aggressive tumour.
> The radiologist report came back clean.
> I annoyed the radiologists until they re-checked. They did so with 3 radiologists and their own AI. Came back clean, so looks like Claude was wrong.
> But looks how convincing Claude sounds! We're still early...
This is super dangerous. It's WebMD all over again but much worse. It's hard enough to diagnose but now you have to fight against some phantom model that the patient thinks is smarter than their doctor.
There is probably a disproportionate amount of texts describing the presence of a tumor on a scan than texts describing what a normal scan would look like. So this is expected. But I think that is not the purpose of a LLM. There should be a classification algorithm detecting the presence or not of a tumor and then a a LLM can be used to generate a description of it or why the scan is normal.
And I don't know any, but there is probably already tools to help radiologists, what they call their "own AI".
After reading the diagnosis I caught myself wanting to examine the MRI to see if the bright area really exists, which means I fell for this too. Imagine being the person who received this diagnosis. Of course you're going to be concerned, even if it is LLM garbage.
Wish we'd get more articles from actual practitioners using generative AI to do things. Nearly all the articles you see on the subject are on the level of existential threats or press releases, or dunking on mistakes made by LLMs. I'd really rather hear a detailed writeup from professional people who used generative AI to accomplish something. The only such article I've run across in the wild is this one [0] from jetbrains. Anyway, if anyone has any article suggestions like this please share!
This is expected, right? It would be surprising if something as general as GPT-4V was trained on a diverse and nuanced set of radiology images, vs say, a traditional CNN trained and refined for particular detection of specific diseases. It feels akin to concluding that a picnic basket doesn't make a good fishing toolbox after all. Worse would be if someone in power was actually enthusiastically recommending plain GPT-V as a realistic solution for specialized vision tasks.
The ECG stood out to me because my wife is a cardiologist, and worked with a company iCardiac to look for specific anomalies in ECGs. They were looking for LongQT to ensure clinical trials didn't mess with the heart. There was a team of data scientists that worked to help automate this, and they couldn't so they just augmented the UI for experts - there was always a person in the loop.
Looking at an ECG as a layperson it's a problem that seems easy if you know about some tools in your math toolbox, but it's deceptively hard and a false negative might mean death for the patient. So, I'm not going to trust a generic vision transformer model to this task, and until I see overwhelming evidence I won't trust a specifically trained model for this task.
Have there been many studies like this one that have been judged blind?
I'd trust a study like this a little more if the human evaluators were presented with the output of GPT-4 mixed together with the output from human experts, such that they didn't know if the explanation they were evaluating came from a human or an LLM.
This would reduce the risk that participants in the study, consciously or subconsciously, marked down AI results because they knew them to be AI results.
Soon enough we will train models with the firepower of GPT4 (5?), that are purpose built and trained from the ground up to be medical diagnostic tools. A hard focus on medicine with thousands of hours of licensed physician RLHF. It will happen and it almost certainly already underway.
But until it comes to fruition, I think it's largely a waste for people to spend time studying the viability of general models for medical tasks.
Definitely. Saw a video recently mentioning the increase in well-paid gigs for therapists in a metro area, which ask only to record all therapist-patient interactions and treat it as IP. It seems likely that the data would be part of a corpus to train specialists models for psychotherapy AI, and if this kind of product can actually work I don’t see why every other analytical profession wouldn’t be targeted and well underway. Lots of guesses there though, and personally I hope we aren’t rushing into this.
I question the competence of anyone using any modern AI (not just GPTs) for medical decisions. Beyond being capable of passing a multiple-choice exam (which can be done by trained monkeys) they're not ready for this, and they won't be for years. I guess confirmation of this is still good to know.
> AI company releases generalist model for testing/experimentation
> Users unwisely treat it like a universal oracle and give it tasks far outside its training domain
> It doesn't perform well
> People are shocked and warn about the "dangers of AI"
This happens every time. Why can't we treat AI tools like they actually are: interesting demonstrations of emergent intelligent properties that are a few versions away from production-ready capabilities?
As for the future I am certain LLMs will become more efficient in terms of resource consumption and easier to train, but I am not so certain that they're going to fundamentally solve the problems that LLMs have now.
Try to train one to tell you what kind of shoes a person is wearing in an image and it will likely "short circuit" and conclude that a person with fancy clothes is wearing fancy shoes (true much more often than not) even if you can't see their shoes at all. (Is a person wearing a sports jersey and holding a basketball on a basketball court necessarily wearing sneakers?) This is one of those cases where bias looks like it is giving better performance but show a system like that a basketball player wearing combat boots and it will look stupid. So much of the apparent high performance LLMs comes out of this bias and I'm not sure the problem can really be fixed.
He said the same thing about perceptrons in general. When it comes to bunk, Minsky was... let's just say he was a subject matter expert.
He caused a lot of people to waste a lot of time. I've got your XOR right here, Marvin...
Just add another model asking it if there are shoes visible or not.
A quick Google search found this paper, here's a quote:
"Our evaluation shows that GPT-4V excels in understanding medical images and is able to generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis."
https://arxiv.org/html/2310.20381v5
For sure there's some waffling at the end, but many people will come away with the feeling that this is something GPT-4V can do.
If an ai is able to record an operation with 9x% accuracy but you save humans, the insurance might just accept this.
Nonetheless the chance that ai will be able to continually become better and better at it, is very high.
Our society will switch. Instead of rewriting software or updating software we will fine-tune models and add more examples.
Because this is actually sustainable (you can reuse the old data) this will win at the end.
The only thing changing in the future is model architecture and training data will only be added.
That, and these companies have a substantial financial interest in pushing the omniscience/omnipotence narrative. OpenAI trying to encourage responsible AI usage is like Phillip Morris trying to encourage responsible tobacco use. Fundamental conflict of interest.
Deleted Comment
1. Because you've got one or more of the below spinning it into either a butterfly to chase or a product to buy:
- 'Research Groups' e.x. Gartner
- Startups with an 'AI' product
- Startups that add an 'AI' feature
- OpenAI [0]
2. I'm currently working on a theory that a reasonable portion of population in certain circles is viewing ChatGPT and it's ilk as the perfect way to mask their long COVID symptoms and thus embracing blindly. [1]
[0] - The level of hyuperis in some articles about ChatGPT3 reminded me a little too much of the fake viral news around the launch of Pokemon Go, adjusted for fake viral news producers improving quality of tactics. Especially because it flares up when -they- do things... but others?
[1] - Whoever needs to read this probably won't, but; I know when you had ChatGPT wrote the JIRA requirements and more importantly I know when you didn't sanity check what it spit out.
When we see a thing with coherent writing about any subject we're not experts in, even when we notice the huge and dramatic "wet pavements cause rain" level flaws when it's writing about our own speciality, we forget all those examples of flaws the moment the page changes and we revert once more to thinking it is a font of wisdom.
We've been doing this with newspapers for a century or two before Michael Crichton coined the Gell-Mann Amnesia effect.
Deleted Comment
I wrote a rant a few months back about the greatest threat to generative AI is people using it poorly: https://minimaxir.com/2023/10/ai-sturgeons-law/
"It is difficult to get a man to understand something when his salary depends on his not understanding it."
> A friend sent me MRI brain scan results and I put it through Claude.
> No other AI would provide a diagnosis, Claude did.
> Claude found an aggressive tumour.
> The radiologist report came back clean.
> I annoyed the radiologists until they re-checked. They did so with 3 radiologists and their own AI. Came back clean, so looks like Claude was wrong.
> But looks how convincing Claude sounds! We're still early...
https://twitter.com/misha_saul/status/1771019329737462232
And I don't know any, but there is probably already tools to help radiologists, what they call their "own AI".
https://blog.jetbrains.com/blog/2023/10/16/ai-graphics-at-je...
For whatever reason Hugo's docu is weird to get into while chatgpt is shockingly good in telling me what I actually look for
Looking at an ECG as a layperson it's a problem that seems easy if you know about some tools in your math toolbox, but it's deceptively hard and a false negative might mean death for the patient. So, I'm not going to trust a generic vision transformer model to this task, and until I see overwhelming evidence I won't trust a specifically trained model for this task.
I'd trust a study like this a little more if the human evaluators were presented with the output of GPT-4 mixed together with the output from human experts, such that they didn't know if the explanation they were evaluating came from a human or an LLM.
This would reduce the risk that participants in the study, consciously or subconsciously, marked down AI results because they knew them to be AI results.
But until it comes to fruition, I think it's largely a waste for people to spend time studying the viability of general models for medical tasks.