Clinical knowledge in LLMs does not translate to human interactions

Interesting quote from the venturebeat article linked:

> “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.

In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human, the human might not know what is important or they might be disinclined to share, and physicians can learn to overcome this. However, in this study, this isn't actually what happened - the participants were looking to diagnose a made-up scenario, where the symptoms were clearly presented to them, and they had no incentive to lie or withhold embarrassing symptoms since they weren't actually happening to them, it was all made up - and yet, it still seemed to happen, that the participants did not effectively communicate all the necessary information.

littlestymaar · 2 months ago

> In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human

That's true for most use-case, especially for coding.

TZubiri · 2 months ago

As a patient, I am responsible for sharing information to my doctor. I wouldn't hold it against them if they didn't extract information from me.

smogcutter · 2 months ago

Sure, but think of a good help desk tech: if they waited for users to accurately report useful information, nothing would ever get fixed.

SecretDreams · 2 months ago

Sure. But as a patient, you are also not expected to know what is or isn't important. Omitting unimportant information (to you) because your brain does a low pass filter is partially what the doctor is trying to bypass.

numpad0 · 2 months ago

Okay, so your code has been segfaulting at line 123 in complicated_func.cpp, and you want to know to which version of libc you have to roll back to as well as related packages if any.

What's the current processor temperature, EPS12V voltage, and ripple peaks if you have a oscilloscope? Could you paste cpuinfo? Have you added or removed RAM or PCIe device recently? Does the chassis smell and look normal, no billowing smoke, screeching noise, fire?

Good LLMs might start asking these questions soon, but you wouldn't supply these information at the beginning of interaction(and it's always the PSU).

BriggyDwiggs42 · 2 months ago

You don’t want to make systems that require people to be as diligent as you because those systems will have bad outcomes.

mumbisChungo · 2 months ago

Yeah, there's a lot of agency on both sides of the equation when it comes to any kind of consultant. You're less likely to have bad experiences with doctors if you're self aware and thoughtful about how you interact with them.

This is just a random anecdote but ChatGPT (when given many, many details with 100% honesty) has essentially matched exactly what doctors told me in every case where I've tested it. This was across several non-serious situations (what's this rash) and one quite serious situation, although the last is a decently common condition.

The two times that ChatGPT got a situation even somewhat wrong, were:

- My kid had a rash and ChatGPT thought it was one thing. His symptoms changed slightly the next day, I typed in the new symptoms, and it got it immediately. We had to go to urgent care to get confirmation, but in hindsight ChatGPT had already solved it. - In another situation my kid had a rash with somewhat random symptoms and the AI essentially said "I don't know what this is but it's not a big deal as far as the data shows." It disappeared the next day.

It has never gotten anything wrong other than these rashes. Including issues related to ENT, ophthalmology, head trauma, skincare, and more. Afaict it is basically really good at matching symptoms to known conditions and then describing standard of care (and variations).

I now use it as my frontline triage tool for assessing risk. Specifically ChatGPT says "see a doctor soon/ASAP" I do it, if it doesn't say to go see a doctor, I use my own judgment ie I won't skip a doctor trip if I'm nervous just because AI said so. This is all 100% anecdotes and I'm not disagreeing with the study, but I've been incredibly impressed by its ability to rapidly distill medical standard of care.

extr · 2 months ago

I've had an identical experience of ChatGPT misidentifying my kid's rash. In my case I would say it got points for being in the same ballpark - it guessed HFM, the real answer was "an unnamed similar-ish virus to HFM but not HFM proper". The treatment was the same, just let it run it's course and our kid was fine. But I think it also made me realize that our pediatrician is still quite important in the sense that she has local, contextual, geography-based knowledge of what other kids in the area are experiencing too. She recognized it immediately because she had already seen 2 dozen other kids with it in the last month. That's going to be hard for any AI system to replicate until some distant time when all healthcare data is fed into The Matrix.

brundolf · 2 months ago

I wonder if the software developer mindset plays into this. We're really good at over-reporting all possibly-relevant information for "debugging" purposes

forgetfreeman · 2 months ago

I sincerely hope your credulity doesn't swing around to bite you in the ass with this.

IshKebab · 2 months ago

I think it's fine as long as you don't blindly believe it. I.e. if it tells you it's X then you go and look at credible sources to confirm (sounds like he did that). If it tells you "it's nothing don't worry" but you have some reason to disagree then you don't blindly obey the AI.

When diagnosing kids illnesses you're basically half guessing most of the time anyway. For example the NHS tells you to call 111 (non-emergency medical number) if they have a fever and they "do not want to eat, or are not their usual self and you're worried".

I think in America access to healthcare is pretty bad and expensive so probably a bit of AI help is a good thing. Vague wooly searches using written descriptions is one of the things they're actually quite good at.

Fripplebubby · 2 months ago

hackitup7 · 2 months ago

zora_goron · 2 months ago

This difference between medical board examinations and real world practice is something that mirrors my real-world experience too, having finished med school and started residency a year ago.

I’ve heard others say before that real clinical education starts after medical school and once residency starts.

keeptrying · 2 months ago

Could you elaborate on what you mean?

That 80% of medical issues could be categorized as "standard medicine" with some personalization to the person?

residency you obviously see a lot of real life complicated cases but aren't the majority of the cases something a non resident could guide if not diagnose ?

bryant · 2 months ago

For anyone keen on dissecting this further, they uploaded enough to github for people to dive into their approach in depth.

https://github.com/am-bean/HELPMed (also linked in the paper)

dosinga · 2 months ago

Really what it seems to say is that LLMs are pretty good at identifying underlying causes and recommending medical actions but if you let humans use LLMs to self diagnose the whole thing falls apart, if I read this correctly

majormajor · 2 months ago

Yeah it sounds like "LLMs are bad at interacting with lay humans compared to being prompted by experts or being given well-formed questions like from licensing exams."

Feels to me like how two years ago "prompt engineering" got a bunch of hype in tech companies, and now is nonexistent because the models began being trained and prompted specifically to mimic "reasoning" for the sorts of questions tech company users had. Seems like that has not translated to reasoning their way through the sort of health conversations a non-medical-professional would initiate.

Deleted Comment

wongarsu · 2 months ago

And there seem to be concrete results that would allow you to improve the LLM prompt to make these interactions more successful. Apparently giving the human 2-3 possible options and letting the human have the final choice was a big contributor to the bad results. Their recommendations go the route of "the model should explain it better" but maybe the best results would be achieved if the model was prompted to narrow it down until there is only one likely diagnosis left. This is more or less how doctors operate after all.

twotwotwo · 2 months ago

At work, one of the prompt nudges that didn't work was asking it to ask for clarifications or missing info rather than charge forward with a guess. "Sometimes do X" instructions don't do well generally when the trigger conditions are fuzzy. (Or complex but stated in few words like "ask for missing info.") I can believe part of the miss here would be not asking the right questions--that seems to come up in some of their sample transcripts.

In general at work nudging them towards finding the information they need--first search for the library to be called, etc.--has been spotty. I think tool makers are putting effort into this from their end: newer versions of IDEs seemed to do better than older ones and model makers have added things like mid-reasoning tool use that could help. The raw Internet is not full of folks transparently walking through info-gathering or introspecting about what they know or don't, so it probably falls on post-training to explicitly focus on these kinds of capabilities.

I don't know what you really do. You can lean on instruction-following and give a lot of examples and descriptions of specific times to ask specific kinds of questions. You could use prompt distillation to try to turn that into better model tendencies. You could train on lots of transcripts (these days they'd probably include synthetic). You could do some kind of RL for skill at navigating situations where more info may be needed. You could treat "what info is needed and what behavior gets it?" as a type of problem to train on like math problems.

I've seen that LLMs hallucinate in very subtle ways when guidng you through a course of treatment.

Once when having to administer eyedrops to a parent, and I saw redness and was being conservative, it told me the wrong drop to stop. The doctor saw my parent the next day so it was all fixed but did lead to me freaking out.

Doctors behave very differently from how we normal humans behave. They go through testing that not many of us would be able to sit through let alone pass. And they are taught a multitude of subjects that are so far away from the subjects everyone else learns that we have no way to truly communicate to them.

And this massive chasm is the problem, not that the LLM is the wrong tool.

Thinking probabilistically (mainly basyesia) and understanding the initial first two years of medschool will help you use an LLM much more effectively for your health.