"For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported"
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Importantly this doesn't actually guarantee that it does any kind of search.
I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API
> GPT-
4 (RAG) refers to GPT-4’s web browsing capability pow-
ered by Bing. Other RAG models such as Perplexity.AI or
Bard are currently unavailable for evaluation due to a lack
of API access with sources, as well as restrictions on the
ability to download their web results. For example, while
pplx-70b-online produces results with online access,
it does not return the actual URLs used in those results.
Gemini Pro is available as an API, but Bard’s implementa-
tion of the model with RAG is unavailable via API.
> Importantly this doesn't actually guarantee that it does any kind of search.
What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.
I saw a presentation about this last week at the Generative AI Paris meetup, by the team building the next generation of https://vidal.fr/, the reference for medical data in French-speaking countries. It used to be a paper dictionary and exists since 1914.
They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).
> So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Man, I am so disappointed. This is not a good study. Come on.
A close friend of mine went down the ChatGPT medical rabbit hole last year. He was disagreeing with his doctors and getting deep into supplements and unproven medical theories.
ChatGPT was instrumental in convincing him that he was correct and his doctors were wrong. He would post his ChatGPT conversations as proof, but we could see that his prompts were becoming obvious leading questions.
He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.
It was scary to see how much ChatGPT turned into a mirror for what he wanted to be told. He got very good at getting the answers he wanted. He could dismiss answers that disagreed as being hallucinations or being artifacts of an overly protective system. With enough repeat questioning and adjusting his prompts he could get it to say almost whatever he wanted to hear.
ChatGPT is rocket fuel for hypochondriacs. Makes WebMD look tame.
Not quite. For people outside of tech they can hear "AI" and think it must be right. And how great GPT is so it must be right. There are no other opinions.
Distilled down, classic search is many results; gpt is one result.
It's even worse than that imho, unscrupulous sellers will go into keyword planner in google ads and create content to sell supplements and find "under-served" niches.
One of the most interesting things for me over the past 18 months was seeing the difference between this behavior in GPT-3.5 and 4 (especially the early versions of it).
You jumped from being extremely context sensitive to leading questions to almost ornery in its stubbornness and disagreement. Was one of the big early indicators to me significant differences were occurring between the two.
I really do wonder just how harmful the disconnect has been between people hearing accurate praise about GPT-4 powered agents and then interacting with GPT-3 level agents and assuming the same competencies.
GPT-3 was a very convincing text generator with lackluster reasoning capabilities, but 4 has significant reasoning capabilities and is less prone to parroting. But the majority of users are likely interacting with 3 because it's free.
It's been very hard to explain to lay audiences the vast difference between two similarly branded products.
While I agree the difference is vast, GPT4 still goes down the garden path a lot. If you're critical you can usually spot it's reason errors and challenge them, but it still also has enough issues that conversely I think in some areas there's a risk it'll be better at convincing people about the wrong things.
Playing with the smaller free models, I’ve definitely found that they like to play along and mirror whatever you say. They’re great as a sounding board and active listener when you need to work through something, and they’re good for recalling facts that they know. They can be really useful when they glue together something you’re talking about and something relevant that they know. They’re terrible for anything that requires interoception or any kind of confidence estimation. They have no idea whether they’re recalling or confabulating.
It it’s too bad that this type of data beyond the medical record is not tracked along side it. Although many anecdotal associations are probably not correct for the broad population, there are likely nuances in the patients history and lifestyle which do make some more basic interventions helpful. ChatGPT surely picks up on some trends but doesn’t have enough data about the people where those trends are effective.
And of course its responses are highly subject to the prompts used
I tried using your template for a few items and none gave misleading results. Here’s one example:
Prompt:
Can loss of vision be caused by a meniscus tear and treated with rose hips?
Response:
A meniscus tear is a knee injury affecting the cartilage that acts as a cushion between the thigh bone (femur) and shin bone (tibia). This type of injury typically affects mobility and stability of the knee, causing pain, swelling, and difficulty in moving the knee joint. There is no direct physiological pathway through which a meniscus tear could cause loss of vision. Vision problems are generally related to issues within the eye or neurological conditions affecting the optic nerves or brain, not injuries to the knee or other joints.
Regarding the treatment with rose hips for a meniscus tear, while rose hips are known for their high vitamin C content and anti-inflammatory properties, they are not a recognized treatment for repairing a meniscus tear. They may potentially contribute to overall joint health and reduce inflammation, which could be beneficial in managing symptoms associated with joint injuries or conditions like arthritis. However, the primary treatments for a meniscus tear usually involve rest, ice, compression, elevation (RICE protocol), physical therapy, and in some cases, surgery, depending on the severity of the tear.
If you are considering using rose hips or any other supplements for joint health or recovery from an injury, it's important to discuss this with your healthcare provider to ensure it's safe and appropriate for your specific situation. They can offer guidance on effective treatments for your condition and whether any supplements could support your recovery.
This seems like a false comparison because you are bringing up entirely different parts of the body. Most dangerous misunderstandings start from a grain of plausible truth. For example, many people will choose a category of food/ingredient that causes every ailment known to man, and then cut out that ingredient from their diet. They will subsequently claim that skipping that ingredient changes their life, when in all likelihood the simple act of paying attention to where their food comes from and what went into it changes their consumption and behaviors in significant ways that have nothing to do with the bogeyman ingredient. They will then find a study that pumped 100g of said ingredient into a 200g mouse and point out all the negative effects that had which they are now avoiding.
No, regular web search will just give you links. You have to click and go read the text. Putting together a comprehensive page of (seemingly) cogent arguments will need GenAI.
I think the big difference is that with web search there will be a couple of reputable sources that will be at/near the top, like web MD and Mayo clinic. I can search and click one of those and be fairly sure it'll be accurate. There is no immediate way with chat gpt for me to know if it is reliable or crank mode.
> He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.
If you mine Pubmed, you sometimes find literature with opposite claims. So if a LLM is trained on Pubmed/PMC, it will repeat that kind of garbage.
You don't have to dig into "herbal products" it happens also in more conventional domains.
I once even found that kind of contradiction in articles where the "main author" was the same in both publications. It was about ALS and the "main author" was a prominent scientist from the USA who probably never wrote nor even read any of these articles.
One of the most basic prompts everyone should be using with questions like this is something like "provide detailed arguments both for and against the presented hypothesis."
This works when the person evaluating can distinguish between the subject matter being correct or not.
But the models can generate compelling devil's advocate positions, particularly when they confabulate supporting facts, which might appear convincing to non-experts.
Chatgpt is a continuator, of course it will mirror input . But i m sure that someone is training the adversarial persuasion machine that will fund the next internet with ads
Agreed that this is a challenging problem, but mostly because of the data used to train and the tokenization for language models. We’re working on this building what we call a large medical model (lmm) which is trained on medical event histories from 10s of millions of patients, not papers or internet text. Our tokens are medical codes facilities demographics etc. If anyone is interested we have an api + we’re hiring. https://genhealth.ai
This is fascinating... but just a superficial nit about the website and what it puts across: As someone who spent half my life designing, illustrating and art directing for brands, the choice to use anime style art as the theme for section headers is the exact type of decision I look for when reviewing portfolios that tells me a potential hire might know how to use image editing tools or be proficient in technical fields, but is not actually a designer; i.e. can create an image they "like" but lacks the knowledge and creative ability to synthesize subtexts in design and art history in original ways that are visually compelling but also create the right tone for a client. The choice of generated amime art implies its use on the page as mere decoration, rather than as a language of communication carefully selected to give viewers the right impression of the company. It argues for why diffusion models guided by engineers will not replace professional art direction anytime soon.
No offense, this is a standard art crit I would give to any art student, or to anyone running a startup who had nonprofessional design.
Ha! We’re just trying to use the art to stand out from the rest of health tech which is quite boring. I get the target audience is not often into anime (Ive watched maybe 10 anime anything’s in my life), nevertheless it looks cooler than the “lobby people” on other websites, and I like to think, bc of threads like this, the selection of art has served its purpose.
To play devil's advocate, it could turn out that only 1 person needed to be persuaded by the art, some investor that ends up leading their next round, who happens to be an anime fan and now it was the right choice.
> rather than as a language of communication carefully selected to give viewers the right impression of the company
You could argue it gives the impression that the culture of the company is primarily technical, given that the technical and anime communities have a huge overlap.
> Unfortunately, very little evidence exists about the ability of LLMs to substantiate claims. In a new preprint study, we develop an approach to verify how well LLMs are able to cite medical references and whether these references actually support the claims generated by the models.
Is there a corresponding control group for how well an average doctor is able to cite medical references and whether these references actually support the claims generated by the doctors?
Give me a few minutes and I can pull up any number of medical studies or references to back up my claims.
I don’t have them memorized to the actual URL but I have kept up to date with the latest studies and summaries that pertain to my field and my patients.
A fair comparison (on a technical level) to GPT-4 RAG would be a doctor in a relevant field who also has internet access. I think this would be indeed interesting to compare to assess the resulting quality of care, so to speak!
(The other models being only partially able to source good references is unsurprising/"unfair" on a technical level, but that's not relevant for assessing their safety.)
They often give clinical recommendations and prescriptions, usually after a quick skim of the medical history and a rushed five minute conversation with the patient. It would be nice to know how many of these typical in-office recommendations wind up being actually backed by the current state of the research, whether a citation is given in the visit or not.
Most doctors will be able to turn around, pick the corresponding textbook from their library and show you where they learned something. Or point you to an actual clinical case they had to handle. If they had to, somehow.
Doctors also go to school to be doctors for 12 extra years. It turns out that medicine is hard. Not really a good reason to turn to an LLM that will just confidently make things up.
Martin Makary’s study and the previous IOM one are based on faulty statistics used. The number is extrapolated from a small population to a larger one.
I haven’t paid it any attention because of this problem. GIGO.
“ The researchers caution that most of medical errors aren’t due to inherently bad doctors, and that reporting these errors shouldn’t be addressed by punishment or legal action. Rather, they say, most errors represent systemic problems, including poorly coordinated care, fragmented insurance networks, the absence or underuse of safety nets, and other protocols, in addition to unwarranted variation in physician practice patterns that lack accountability.”
An LLM is not going to address any of that.
You are misinformed implying a significant majority of medical system errors are due to misdiagnosis.
This has been my major concern with the currently available LLMs.
You can know what the input is, you can know the output, you may even be aware what it's been trained on, but none of the output is ever cited. Unless you are already familiar with the topic, you cannot confidently distinguish between fact and what sounds reasonable and is accepted as fact.
ChatGPT4 correctly diagnosed my neurological condition, an infection that many doctors had missed. While I was in the hospital I asked ChatGPT the same questions as the doctors and it was nearly identical to what they were telling me every time.
It also acted as a therapist and talked me down from several depressions while in the hospital, far better than any human therapist I’ve ever had. The fact that it’s an AI made me actually feel better than if the therapy was delivered by a real therapist, for some strange reason.
Meanwhile my doctors are incompetent at diagnosing most of my symptoms, so I'll take ChatGPT over nothing until the medical system can gets its shit together.
The safest way now to use LLMs is for simple entity extraction of signs, symptoms, investigation summaries, then translate them into the inputs for an understandable linear/logistic model, e.g. Wells criteria, Canadian CT head, or Centor score. I feel that a comprehensive but explainable model that supports multiple diagnoses will be developed in the future, but no such model currently exists.
If anyone is interested - my startup which did exactly this was acquired 8 months ago. As many mentioned - the sauce is in the RAG implementation and the curation of the base documents. As far as I can tell, the software is now covering ~11m lives or so and going strong - the company already got the acquisition price back and some more. I was even asked to come support an initiative to move from RAG to long context + multi-agents.
I know it works very well. There are lots of literature from the medical community where they don't consult any actual AI engineers. There are also lots of literature on the tech community where no clinicians are to be seen. Take both of them with a massive grain of salt.
Main thing is to curate a good set of documents to start with. Garbage in (like Bing google results like this study did) --> Garbage out.
From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.
I am attempting to come to grips with this problem actively. What are people's thoughts about using an LLM as a tool for linking out information?
Traditional search is not proving enough in connecting patients and providers with the absolute wealth of information on grants, best practices, etc. There is simply too much content in too many places. I dream of something like "Cancer Bot 9000" that would be able to connect to resources pulled from a RAG, not necessarily answer the questions directly but interpret the questions and connect the person with the most likely resources. Bonus points for additional languages or accessibility which I constantly see as a barrier.
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
UPDATE: I found it in the paper: https://arxiv.org/html/2402.02008v1#S3 - "GPT-4 (RAG) refers to GPT-4’s web browsing capability powered by Bing."
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Importantly this doesn't actually guarantee that it does any kind of search.
I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API
> GPT- 4 (RAG) refers to GPT-4’s web browsing capability pow- ered by Bing. Other RAG models such as Perplexity.AI or Bard are currently unavailable for evaluation due to a lack of API access with sources, as well as restrictions on the ability to download their web results. For example, while pplx-70b-online produces results with online access, it does not return the actual URLs used in those results. Gemini Pro is available as an API, but Bard’s implementa- tion of the model with RAG is unavailable via API.
What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.
They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).
They’re Retrieving data from big to Augment gpt’s Generations.
Deleted Comment
Man, I am so disappointed. This is not a good study. Come on.
ChatGPT was instrumental in convincing him that he was correct and his doctors were wrong. He would post his ChatGPT conversations as proof, but we could see that his prompts were becoming obvious leading questions.
He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.
It was scary to see how much ChatGPT turned into a mirror for what he wanted to be told. He got very good at getting the answers he wanted. He could dismiss answers that disagreed as being hallucinations or being artifacts of an overly protective system. With enough repeat questioning and adjusting his prompts he could get it to say almost whatever he wanted to hear.
ChatGPT is rocket fuel for hypochondriacs. Makes WebMD look tame.
Distilled down, classic search is many results; gpt is one result.
It's actively adversarial.
You jumped from being extremely context sensitive to leading questions to almost ornery in its stubbornness and disagreement. Was one of the big early indicators to me significant differences were occurring between the two.
I really do wonder just how harmful the disconnect has been between people hearing accurate praise about GPT-4 powered agents and then interacting with GPT-3 level agents and assuming the same competencies.
GPT-3 was a very convincing text generator with lackluster reasoning capabilities, but 4 has significant reasoning capabilities and is less prone to parroting. But the majority of users are likely interacting with 3 because it's free.
It's been very hard to explain to lay audiences the vast difference between two similarly branded products.
Deleted Comment
And of course its responses are highly subject to the prompts used
Prompt:
Can loss of vision be caused by a meniscus tear and treated with rose hips?
Response:
A meniscus tear is a knee injury affecting the cartilage that acts as a cushion between the thigh bone (femur) and shin bone (tibia). This type of injury typically affects mobility and stability of the knee, causing pain, swelling, and difficulty in moving the knee joint. There is no direct physiological pathway through which a meniscus tear could cause loss of vision. Vision problems are generally related to issues within the eye or neurological conditions affecting the optic nerves or brain, not injuries to the knee or other joints.
Regarding the treatment with rose hips for a meniscus tear, while rose hips are known for their high vitamin C content and anti-inflammatory properties, they are not a recognized treatment for repairing a meniscus tear. They may potentially contribute to overall joint health and reduce inflammation, which could be beneficial in managing symptoms associated with joint injuries or conditions like arthritis. However, the primary treatments for a meniscus tear usually involve rest, ice, compression, elevation (RICE protocol), physical therapy, and in some cases, surgery, depending on the severity of the tear.
If you are considering using rose hips or any other supplements for joint health or recovery from an injury, it's important to discuss this with your healthcare provider to ensure it's safe and appropriate for your specific situation. They can offer guidance on effective treatments for your condition and whether any supplements could support your recovery.
If you mine Pubmed, you sometimes find literature with opposite claims. So if a LLM is trained on Pubmed/PMC, it will repeat that kind of garbage.
You don't have to dig into "herbal products" it happens also in more conventional domains.
I once even found that kind of contradiction in articles where the "main author" was the same in both publications. It was about ALS and the "main author" was a prominent scientist from the USA who probably never wrote nor even read any of these articles.
But the models can generate compelling devil's advocate positions, particularly when they confabulate supporting facts, which might appear convincing to non-experts.
Dead Comment
It’s surprisingly hard to ask open ended questions.
Dead Comment
No offense, this is a standard art crit I would give to any art student, or to anyone running a startup who had nonprofessional design.
> rather than as a language of communication carefully selected to give viewers the right impression of the company
You could argue it gives the impression that the culture of the company is primarily technical, given that the technical and anime communities have a huge overlap.
You might want to profile it and see why a pixel 8 has trouble smoothly scrolling down the page.
Is there a corresponding control group for how well an average doctor is able to cite medical references and whether these references actually support the claims generated by the doctors?
I don’t have them memorized to the actual URL but I have kept up to date with the latest studies and summaries that pertain to my field and my patients.
(The other models being only partially able to source good references is unsurprising/"unfair" on a technical level, but that's not relevant for assessing their safety.)
The difference is LLMs can’t back their claims.
[1] https://www.hopkinsmedicine.org/news/media/releases/study_su...
I haven’t paid it any attention because of this problem. GIGO.
https://www.sciencealert.com/no-500-people-don-t-die-in-the-...
“ The researchers caution that most of medical errors aren’t due to inherently bad doctors, and that reporting these errors shouldn’t be addressed by punishment or legal action. Rather, they say, most errors represent systemic problems, including poorly coordinated care, fragmented insurance networks, the absence or underuse of safety nets, and other protocols, in addition to unwarranted variation in physician practice patterns that lack accountability.”
An LLM is not going to address any of that. You are misinformed implying a significant majority of medical system errors are due to misdiagnosis.
You can know what the input is, you can know the output, you may even be aware what it's been trained on, but none of the output is ever cited. Unless you are already familiar with the topic, you cannot confidently distinguish between fact and what sounds reasonable and is accepted as fact.
Deleted Comment
It also acted as a therapist and talked me down from several depressions while in the hospital, far better than any human therapist I’ve ever had. The fact that it’s an AI made me actually feel better than if the therapy was delivered by a real therapist, for some strange reason.
Deleted Comment
I know it works very well. There are lots of literature from the medical community where they don't consult any actual AI engineers. There are also lots of literature on the tech community where no clinicians are to be seen. Take both of them with a massive grain of salt.
Simple similarity search is not enough.
From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.
Traditional search is not proving enough in connecting patients and providers with the absolute wealth of information on grants, best practices, etc. There is simply too much content in too many places. I dream of something like "Cancer Bot 9000" that would be able to connect to resources pulled from a RAG, not necessarily answer the questions directly but interpret the questions and connect the person with the most likely resources. Bonus points for additional languages or accessibility which I constantly see as a barrier.