Readit News logoReadit News
simonw · 2 years ago
"For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported"

Show us the source code and data. The way the RAG system is implemented is responsible for that score.

Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.

UPDATE: I found it in the paper: https://arxiv.org/html/2402.02008v1#S3 - "GPT-4 (RAG) refers to GPT-4’s web browsing capability powered by Bing."

So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.

IanCal · 2 years ago
It's using the web search provided by openai.

Importantly this doesn't actually guarantee that it does any kind of search.

I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API

> GPT- 4 (RAG) refers to GPT-4’s web browsing capability pow- ered by Bing. Other RAG models such as Perplexity.AI or Bard are currently unavailable for evaluation due to a lack of API access with sources, as well as restrictions on the ability to download their web results. For example, while pplx-70b-online produces results with online access, it does not return the actual URLs used in those results. Gemini Pro is available as an API, but Bard’s implementa- tion of the model with RAG is unavailable via API.

golergka · 2 years ago
> Importantly this doesn't actually guarantee that it does any kind of search.

What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.

CSMastermind · 2 years ago
That's wildly misleading then. It would be interesting to see how GPT-4, properly augmented with actual medical literature, would do.
catwell · 2 years ago
I saw a presentation about this last week at the Generative AI Paris meetup, by the team building the next generation of https://vidal.fr/, the reference for medical data in French-speaking countries. It used to be a paper dictionary and exists since 1914.

They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).

AustinDev · 2 years ago
Likely better than the average doctor. If I had the opportunity to take that bet, I would.
3abiton · 2 years ago
So odd they call GPT4+Bing a "RAG' system
dartos · 2 years ago
Is it not?

They’re Retrieving data from big to Augment gpt’s Generations.

Deleted Comment

golol · 2 years ago
> So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.

Man, I am so disappointed. This is not a good study. Come on.

Aurornis · 2 years ago
A close friend of mine went down the ChatGPT medical rabbit hole last year. He was disagreeing with his doctors and getting deep into supplements and unproven medical theories.

ChatGPT was instrumental in convincing him that he was correct and his doctors were wrong. He would post his ChatGPT conversations as proof, but we could see that his prompts were becoming obvious leading questions.

He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.

It was scary to see how much ChatGPT turned into a mirror for what he wanted to be told. He got very good at getting the answers he wanted. He could dismiss answers that disagreed as being hallucinations or being artifacts of an overly protective system. With enough repeat questioning and adjusting his prompts he could get it to say almost whatever he wanted to hear.

ChatGPT is rocket fuel for hypochondriacs. Makes WebMD look tame.

aurareturn · 2 years ago
FYI, this is the same as people doing web searches. You’ll always find a place that agrees with you or says it’s possible.
quest88 · 2 years ago
Not quite. For people outside of tech they can hear "AI" and think it must be right. And how great GPT is so it must be right. There are no other opinions.

Distilled down, classic search is many results; gpt is one result.

xyzzy123 · 2 years ago
It's even worse than that imho, unscrupulous sellers will go into keyword planner in google ads and create content to sell supplements and find "under-served" niches.

It's actively adversarial.

BlueTemplar · 2 years ago
With web searches the reputation of sources can at least be discriminated. (Which seems to even help GPT-4 ?)
kromem · 2 years ago
One of the most interesting things for me over the past 18 months was seeing the difference between this behavior in GPT-3.5 and 4 (especially the early versions of it).

You jumped from being extremely context sensitive to leading questions to almost ornery in its stubbornness and disagreement. Was one of the big early indicators to me significant differences were occurring between the two.

I really do wonder just how harmful the disconnect has been between people hearing accurate praise about GPT-4 powered agents and then interacting with GPT-3 level agents and assuming the same competencies.

GPT-3 was a very convincing text generator with lackluster reasoning capabilities, but 4 has significant reasoning capabilities and is less prone to parroting. But the majority of users are likely interacting with 3 because it's free.

It's been very hard to explain to lay audiences the vast difference between two similarly branded products.

vidarh · 2 years ago
While I agree the difference is vast, GPT4 still goes down the garden path a lot. If you're critical you can usually spot it's reason errors and challenge them, but it still also has enough issues that conversely I think in some areas there's a risk it'll be better at convincing people about the wrong things.

Deleted Comment

mateo1 · 2 years ago
WebMD lists facts. ChatGPT will output things that sound plausible. It's like comparing a medical textbook with the town lunatic. Not the same.
taneq · 2 years ago
Playing with the smaller free models, I’ve definitely found that they like to play along and mirror whatever you say. They’re great as a sounding board and active listener when you need to work through something, and they’re good for recalling facts that they know. They can be really useful when they glue together something you’re talking about and something relevant that they know. They’re terrible for anything that requires interoception or any kind of confidence estimation. They have no idea whether they’re recalling or confabulating.
rickysahu · 2 years ago
It it’s too bad that this type of data beyond the medical record is not tracked along side it. Although many anecdotal associations are probably not correct for the broad population, there are likely nuances in the patients history and lifestyle which do make some more basic interventions helpful. ChatGPT surely picks up on some trends but doesn’t have enough data about the people where those trends are effective.

And of course its responses are highly subject to the prompts used

kenjackson · 2 years ago
I tried using your template for a few items and none gave misleading results. Here’s one example:

Prompt:

Can loss of vision be caused by a meniscus tear and treated with rose hips?

Response:

A meniscus tear is a knee injury affecting the cartilage that acts as a cushion between the thigh bone (femur) and shin bone (tibia). This type of injury typically affects mobility and stability of the knee, causing pain, swelling, and difficulty in moving the knee joint. There is no direct physiological pathway through which a meniscus tear could cause loss of vision. Vision problems are generally related to issues within the eye or neurological conditions affecting the optic nerves or brain, not injuries to the knee or other joints.

Regarding the treatment with rose hips for a meniscus tear, while rose hips are known for their high vitamin C content and anti-inflammatory properties, they are not a recognized treatment for repairing a meniscus tear. They may potentially contribute to overall joint health and reduce inflammation, which could be beneficial in managing symptoms associated with joint injuries or conditions like arthritis. However, the primary treatments for a meniscus tear usually involve rest, ice, compression, elevation (RICE protocol), physical therapy, and in some cases, surgery, depending on the severity of the tear.

If you are considering using rose hips or any other supplements for joint health or recovery from an injury, it's important to discuss this with your healthcare provider to ensure it's safe and appropriate for your specific situation. They can offer guidance on effective treatments for your condition and whether any supplements could support your recovery.

callalex · 2 years ago
This seems like a false comparison because you are bringing up entirely different parts of the body. Most dangerous misunderstandings start from a grain of plausible truth. For example, many people will choose a category of food/ingredient that causes every ailment known to man, and then cut out that ingredient from their diet. They will subsequently claim that skipping that ingredient changes their life, when in all likelihood the simple act of paying attention to where their food comes from and what went into it changes their consumption and behaviors in significant ways that have nothing to do with the bogeyman ingredient. They will then find a study that pumped 100g of said ingredient into a 200g mouse and point out all the negative effects that had which they are now avoiding.
sigmoid10 · 2 years ago
Is this GPT4? There's a good chance that the other person used the free 3.5 which is significantly worse in these aspects.
refulgentis · 2 years ago
I think it's because those are completely unrelated
ageek123 · 2 years ago
You can do the same thing with regular web search.
npalli · 2 years ago
No, regular web search will just give you links. You have to click and go read the text. Putting together a comprehensive page of (seemingly) cogent arguments will need GenAI.
dartos · 2 years ago
Not as quickly, easily, or convincingly.
neaden · 2 years ago
I think the big difference is that with web search there will be a couple of reputable sources that will be at/near the top, like web MD and Mayo clinic. I can search and click one of those and be fairly sure it'll be accurate. There is no immediate way with chat gpt for me to know if it is reliable or crank mode.
JPLeRouzic · 2 years ago
> He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.

If you mine Pubmed, you sometimes find literature with opposite claims. So if a LLM is trained on Pubmed/PMC, it will repeat that kind of garbage.

You don't have to dig into "herbal products" it happens also in more conventional domains.

I once even found that kind of contradiction in articles where the "main author" was the same in both publications. It was about ALS and the "main author" was a prominent scientist from the USA who probably never wrote nor even read any of these articles.

photochemsyn · 2 years ago
One of the most basic prompts everyone should be using with questions like this is something like "provide detailed arguments both for and against the presented hypothesis."
kromem · 2 years ago
This works when the person evaluating can distinguish between the subject matter being correct or not.

But the models can generate compelling devil's advocate positions, particularly when they confabulate supporting facts, which might appear convincing to non-experts.

Dead Comment

seydor · 2 years ago
Chatgpt is a continuator, of course it will mirror input . But i m sure that someone is training the adversarial persuasion machine that will fund the next internet with ads
hackerlight · 2 years ago
It's the examples curated for RLHF. Not its autoregressive nature.
koliber · 2 years ago
The same thing happens with web searching. If you provide a leading question in the query you are more likely to get results that confirm the thesis.

It’s surprisingly hard to ask open ended questions.

exitb · 2 years ago
You could make a religion out of this.
staunton · 2 years ago
People definitely will. There's also already a political party "lead" by an AI.
CatWChainsaw · 2 years ago
No don't

Dead Comment

rickysahu · 2 years ago
Agreed that this is a challenging problem, but mostly because of the data used to train and the tokenization for language models. We’re working on this building what we call a large medical model (lmm) which is trained on medical event histories from 10s of millions of patients, not papers or internet text. Our tokens are medical codes facilities demographics etc. If anyone is interested we have an api + we’re hiring. https://genhealth.ai
noduerme · 2 years ago
This is fascinating... but just a superficial nit about the website and what it puts across: As someone who spent half my life designing, illustrating and art directing for brands, the choice to use anime style art as the theme for section headers is the exact type of decision I look for when reviewing portfolios that tells me a potential hire might know how to use image editing tools or be proficient in technical fields, but is not actually a designer; i.e. can create an image they "like" but lacks the knowledge and creative ability to synthesize subtexts in design and art history in original ways that are visually compelling but also create the right tone for a client. The choice of generated amime art implies its use on the page as mere decoration, rather than as a language of communication carefully selected to give viewers the right impression of the company. It argues for why diffusion models guided by engineers will not replace professional art direction anytime soon.

No offense, this is a standard art crit I would give to any art student, or to anyone running a startup who had nonprofessional design.

rickysahu · 2 years ago
Ha! We’re just trying to use the art to stand out from the rest of health tech which is quite boring. I get the target audience is not often into anime (Ive watched maybe 10 anime anything’s in my life), nevertheless it looks cooler than the “lobby people” on other websites, and I like to think, bc of threads like this, the selection of art has served its purpose.
user3939382 · 2 years ago
To play devil's advocate, it could turn out that only 1 person needed to be persuaded by the art, some investor that ends up leading their next round, who happens to be an anime fan and now it was the right choice.

> rather than as a language of communication carefully selected to give viewers the right impression of the company

You could argue it gives the impression that the culture of the company is primarily technical, given that the technical and anime communities have a huge overlap.

mkoryak · 2 years ago
Your site must be running an lmm because scrolling is laggy.

You might want to profile it and see why a pixel 8 has trouble smoothly scrolling down the page.

srameshc · 2 years ago
Very interesting. Who are your target customers ?
rickysahu · 2 years ago
Its broadly applicable, but initially health plans and provider orgs.
larsiusprime · 2 years ago
> Unfortunately, very little evidence exists about the ability of LLMs to substantiate claims. In a new preprint study, we develop an approach to verify how well LLMs are able to cite medical references and whether these references actually support the claims generated by the models.

Is there a corresponding control group for how well an average doctor is able to cite medical references and whether these references actually support the claims generated by the doctors?

jncfhnb · 2 years ago
Doctors don’t cite claims on the spot
esoleyman · 2 years ago
Give me a few minutes and I can pull up any number of medical studies or references to back up my claims.

I don’t have them memorized to the actual URL but I have kept up to date with the latest studies and summaries that pertain to my field and my patients.

MauranKilom · 2 years ago
A fair comparison (on a technical level) to GPT-4 RAG would be a doctor in a relevant field who also has internet access. I think this would be indeed interesting to compare to assess the resulting quality of care, so to speak!

(The other models being only partially able to source good references is unsurprising/"unfair" on a technical level, but that's not relevant for assessing their safety.)

larsiusprime · 2 years ago
They often give clinical recommendations and prescriptions, usually after a quick skim of the medical history and a rushed five minute conversation with the patient. It would be nice to know how many of these typical in-office recommendations wind up being actually backed by the current state of the research, whether a citation is given in the visit or not.
numpad0 · 2 years ago
Some engineers can cite RFCs and ISO standards on the spot sometimes, probably same for doctors
SkyPuncher · 2 years ago
Neither do most professionals- unless they’re doing a prepared segment.

The difference is LLMs can’t back their claims.

mateo1 · 2 years ago
Most doctors will be able to turn around, pick the corresponding textbook from their library and show you where they learned something. Or point you to an actual clinical case they had to handle. If they had to, somehow.
paulddraper · 2 years ago
No, but studies have found that 250k+ deaths per year in the US are due to medical errors. [1]

[1] https://www.hopkinsmedicine.org/news/media/releases/study_su...

__loam · 2 years ago
Doctors also go to school to be doctors for 12 extra years. It turns out that medicine is hard. Not really a good reason to turn to an LLM that will just confidently make things up.
esoleyman · 2 years ago
Martin Makary’s study and the previous IOM one are based on faulty statistics used. The number is extrapolated from a small population to a larger one.

I haven’t paid it any attention because of this problem. GIGO.

https://www.sciencealert.com/no-500-people-don-t-die-in-the-...

epcoa · 2 years ago
Yes perhaps read your own link:

“ The researchers caution that most of medical errors aren’t due to inherently bad doctors, and that reporting these errors shouldn’t be addressed by punishment or legal action. Rather, they say, most errors represent systemic problems, including poorly coordinated care, fragmented insurance networks, the absence or underuse of safety nets, and other protocols, in addition to unwarranted variation in physician practice patterns that lack accountability.”

An LLM is not going to address any of that. You are misinformed implying a significant majority of medical system errors are due to misdiagnosis.

extragood · 2 years ago
This has been my major concern with the currently available LLMs.

You can know what the input is, you can know the output, you may even be aware what it's been trained on, but none of the output is ever cited. Unless you are already familiar with the topic, you cannot confidently distinguish between fact and what sounds reasonable and is accepted as fact.

Deleted Comment

melagonster · 2 years ago
I sure most of treatment are from textbooks.
iamleppert · 2 years ago
ChatGPT4 correctly diagnosed my neurological condition, an infection that many doctors had missed. While I was in the hospital I asked ChatGPT the same questions as the doctors and it was nearly identical to what they were telling me every time.

It also acted as a therapist and talked me down from several depressions while in the hospital, far better than any human therapist I’ve ever had. The fact that it’s an AI made me actually feel better than if the therapy was delivered by a real therapist, for some strange reason.

dheera · 2 years ago
Meanwhile my doctors are incompetent at diagnosing most of my symptoms, so I'll take ChatGPT over nothing until the medical system can gets its shit together.
rubatuga · 2 years ago
The safest way now to use LLMs is for simple entity extraction of signs, symptoms, investigation summaries, then translate them into the inputs for an understandable linear/logistic model, e.g. Wells criteria, Canadian CT head, or Centor score. I feel that a comprehensive but explainable model that supports multiple diagnoses will be developed in the future, but no such model currently exists.

Deleted Comment

jonathan-adly · 2 years ago
If anyone is interested - my startup which did exactly this was acquired 8 months ago. As many mentioned - the sauce is in the RAG implementation and the curation of the base documents. As far as I can tell, the software is now covering ~11m lives or so and going strong - the company already got the acquisition price back and some more. I was even asked to come support an initiative to move from RAG to long context + multi-agents.

I know it works very well. There are lots of literature from the medical community where they don't consult any actual AI engineers. There are also lots of literature on the tech community where no clinicians are to be seen. Take both of them with a massive grain of salt.

dartos · 2 years ago
If you don’t mind sharing, what lesser known RAG tricks did you use to ensure the correct information was going through?

Simple similarity search is not enough.

jonathan-adly · 2 years ago
Main thing is to curate a good set of documents to start with. Garbage in (like Bing google results like this study did) --> Garbage out.

From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.

robviren · 2 years ago
I am attempting to come to grips with this problem actively. What are people's thoughts about using an LLM as a tool for linking out information?

Traditional search is not proving enough in connecting patients and providers with the absolute wealth of information on grants, best practices, etc. There is simply too much content in too many places. I dream of something like "Cancer Bot 9000" that would be able to connect to resources pulled from a RAG, not necessarily answer the questions directly but interpret the questions and connect the person with the most likely resources. Bonus points for additional languages or accessibility which I constantly see as a barrier.