I let ChatGPT analyze a decade of my Apple Watch data, then I called my doctor

Health metrics are absolutely tarnished by a lack of proper context. Unsurprisingly, it turns out that you can't reliably take a concept as broad as health and reduce it to a number. We see the same arguments over and over with body fat percentages, vo2 max estimates, BMI, lactate thresholds, resting heart rate, HRV, and more. These are all useful metrics, but it's important to consider them in the proper context that each of them deserve.

This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.

The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.

I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.

inopinatus · 14 days ago

Many of those metrics are population or sampling measures and are confounded by many factors at an individual level. The most notorious of which is BMI; it is practically a category error to infer someone's health or risk by individual BMI, and yet doing so remains widespread amongst people that are supposed to know better.

Instrumentation and testing become primarily useful at an individual level to explain or investigate someone's disease or disorder, or to screen for major risk factors, and the hazards and consequences of unnecessary testing outweigh the benefits in all but a few cases. For which your GP and/or government will (or should) routinely screen those at actual risk, which is why I pooped in a jar last week and mailed it.

An athlete chasing an ever-better VO2max or FTP hasn't necessarily got it wrong, however. We can say something like, "Bjorn Daehlie’s results are explained by extraordinary VO2max", with an implication that you should go get results some other way because you're not a five-sigma outlier. But at the pointy end of elite sport, there's a clear correlation between marginal improvement of certain measures and competitive outcomes, and if you don't think the difference of 0.01sec between first and third matters then you've never stood on a podium. Or worse, next to one. When mistakes are made and performance deteriorates, it's often due to chasing the wrong metric(s) for the athlete at hand, generally a failure of coaching.

FeteCommuniste · 13 days ago

> The most notorious of which is BMI; it is practically a category error to infer someone's health or risk by individual BMI, and yet doing so remains widespread amongst people that are supposed to know better.

BMI works fine for people who aren't very muscular, which is the great majority of people. Waist to height ratio might be more informative for people with higher muscle mass.

Shank · 14 days ago

> But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.

This is true of many metrics and even lab results. Good doctors will counsel you and tell you that the lab results are just one metric and one input. The body acclimates to its current conditions over time, and quite often achieves homeostasis.

My grandma was living for years with an SpO2 in the 90-95% range as measured by pulse oximetry, but this was just one metric measured with one method. It doesn't mean her blood oxygen was actually repeatedly dropping, it just meant that her body wasn't particularly suited to pulse oximetry.

vidarh · 13 days ago

It doesn't help when doctors are often unaware of outliers affecting the test results. E.g. I've had a number of doctors freak out over my eGFR (kidney function) test results because the default test they use is affected by body mass and diet, and made even worse by e.g. preworkout supplements with creatine. None of my doctors have been aware of this, and I've had to explain it to them.

colechristensen · 14 days ago

>I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.

Modern medicine has failed to move into the era of subtlety and small problems and many people suffer as a result. Fitness nerds and general non-scientists fill the gap poorly so we get a ton of guessing and anecdotal evidence and likely a whole lot of bad advice.

Doctors won't say there's a problem until you're SICK and usually pretty late in the process when there's not a lot of room to make improvements.

At the same time, doctors won't do anything if you're 5% off optimal, but they'll happily give you a medicine that improves one symptom that's 50% off optimal that comes along with 10 side effects. Although unless you're dying or have something really straightforward wrong with you, doctors don't do much at all besides giving you a sedative and or a stimulant.

Doctors don't know what to do with small problems because they're barely studied and the people who DO try to do something don't do it scientifically.

anon7000 · 14 days ago

A worthwhile book to read on this topic is Outlive by Peter Attia (MD). The core premise is that American healthcare focuses far too much on treating problems after they’re extremely severe. It is would be cheaper and healthier to invest more into conservative & preventative care, trying to prevent or minimize problems early in life before they become incredibly dangerous and expensive/difficult/impossible to treat.

I have a close friend who works in conservative care, and it’s astonishing what they see. For example, someone went to a number of specialists and doctors about a throat condition where they really struggled swallowing. They even had to swallow a radioactive pill to do some kind of imaging. Unnecessary exposure, and an expensive process to go through, and ultimately went exactly nowhere.

Meanwhile, it was a simple musculoskeletal issue which my friend was able to resolve in a single visit with absolutely no risk to the patient.

Medical schools need to stop producing MDs who reach for pills as the first line of defense without trying to root cause issues. Do you really need addictive pain killers, or maybe some PT, exercise, massage, etc. to help resolve your pain.

lnsru · 14 days ago

It’s not medicine. It’s healthcare system. Doctor isn’t paid enough to go thoroughly through the complaint and dig deeper. In Germany you get 5 minutes diagnose and that’s all from health insurance. And this from the better doctor. For normal one diagnose comes from 2 minutes interaction. Believing that the diagnose is right is very naive.

Angostura · 14 days ago

> Doctors won't say there's a problem until you're SICK and usually pretty late in the process when there's not a lot of room to make improvements.

As someone who is fit and active,in their 60s with zero obvious symptoms, but is nonetheless on cholesterol and blood pressure medication, this isn't true (in the UK, at least)

PlatoIsADisease · 13 days ago

I think one of the major problems is that biologists/scientists cannot legally treat people. Physicians take their studies and have monopolistic treatment powers over them.

I think this creates a huge knowledge gap.

steveBK123 · 13 days ago

It’s also cultural. Most American doctors don’t bother to tell people if they are overweight and out of shape. It’s not something their customers reward.

Propelloni · 14 days ago

Maybe I'm not getting you right, but IMO it hasn't? I, as a customer/patient, just don't weekly converse with my MD about small issues, and frankly, they have better things to do, for example treating sick people.

Instead I use the health benefits programs of my health care insurer. My insurer has an interest in prevention, so I can get consulting for free (or very low fees), and even kickbacks if I regularly participate in fitness courses and maintain my yearly check-up routine. Now, I live in Germany and it probably is different in other countries, but it just makes economic sense from the insurer's point of view so that I would be surprised if it were very different elsewhere.

sksksk · 14 days ago

>This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.

This gets to one of LLMs' core weaknesses, they blindly respond to your requests and rarely push back against the premise of it.

next_xibalba · 13 days ago

I read somewhere that LLM chat apps are optimized to return something useful, not correct or comprehensive (where useful is defined as the user accepts it). I found this explanation to be a useful (ha!) way to explain to friends and family why they need to be skeptical of LLM outputs.

theshrike79 · 14 days ago

Measuring metrics is easy, it's the algorithm on the backend that matters.

There's a reason why Oura rings are expensive and it's not the hardware - you can get similar stuff for 50€ on Aliexpress.

But none of them predicted my Covid infection days in advance. Oura did.

A device like the Apple Watch that's on you 24/7 is good with TRENDS, not absolute measurements. It can tell you if your heart rate, blood oxygen or something else is more or less than before, statistically. For absolute measurements it's OK, but not exact.

And from that we can make educated guesses on whether a visit to a doctor is necessary.

smallerfish · 13 days ago

> But none of them predicted my Covid infection days in advance. Oura did.

It actually warned you, or retrospectively looking at the metrics you could see that there was a pattern in advance of symptoms? (If the latter, same here with my Garmin watch - precipitous HRV decline in the 7 days before symptoms. But no actual warning.)

yolo3000 · 14 days ago

I'm curious how the ring detected it in advance? I also discovered my Covid when I looked at my Garmin watch and my resting heart rate was 100, until then I had thought I had too much sun that day.

Deleted Comment

saghm · 13 days ago

On the other hand, if compressing to a single number is not possible, a doctor will just refuse to give a grade in that way. In my experience, most doctors tend to be very careful about trying to avoid saying anything definitive that they're not actually sure of, even if they're reasonably confident, in large part because part of their job involves understanding how patients react to how things are communicated to them. Being willing to confidently give a misleading answer to a bad question is itself as bad thing when it comes to health data because regular people aren't able to (and shouldn't be expected to) figure out what various interferences from health data happen to feasible from a given data set.

teleforce · 13 days ago

>But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.

The standard risk model for CVD based on SCORE-2 and PREVENT like parameters are very poor as reported in the recently published paper on the their accuracy performance by the Swedish team [1]. As all CVD risk stratification with cardiologist review, the most important accuracy is sensivity (avoiding false negative that will escape review) of SCORE-2 and PREVENT, 48% and 26%, respectively.

The paper alternative proposal increased the sensitivity to 58% by performing clustering instead of conventional regression models as practiced in the standard SCORE-2 (Europe) and PREVENT (US).

These type of models including the latest proposal performed very poorly as indicated by their otherwise excellent and intuitive display of graphical abstract results [1].

[1] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:

https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...

eleveriven · 13 days ago

The problem is that the product itself invites the wrong expectation

Dead Comment

A year or so ago, I fed my wife's blood work results into chatgpt and it came back with a terrifying diagnosis. Even after a lot of back and forth it stuck to its guns. We went to a specialist who performed some additional tests and explained that the condition cannot be diagnosed with just the original blood work and said that she did not have the condition. The whole thing was a borderline traumatic ordeal that I'm still pretty pissed about.

greenknight · 14 days ago

On the flip side, i had some pain in my chest... RUQ (right upper quadrant for those medical folk).

On the way to the hospital, ChatGPT was pretty confident it was a issue with my gallbladder due to me having a fatty meal for lunch (but it was delicious).

After an extended wait time to be seen, they didnt ask about anything like that, and at the end they were like anything else to add, added it in about ChatGPT / Gallbladder... discharged 5 minutes later with suspicion of Gallbladder as they couldnt do anything that night.

Over the next few weeks, got test after test after test, to try and figure out whats going on. MRI. CT. Ultrasound etc.etc. they all came back negative for the gallbladder.

ChatGPT was persistant. It said to get a HIDA scan, a more specialised scan. My GP was a bit reluctant but agreed. Got it, and was diagnosed with a hyperkinetic gallbladder. It is still unrecognised as an issue, but mostly accepted. So much so my surgeon initally said that it wasnt a thing (then after doing research about it, says it is a thing)... and a gastroentologist also said it wasnt a thing.

Had it taken out a few weeks ago, and it was chroically inflammed. Which means the removal was the correct path to go down.

It just sucks that your wife was on the other end of things.

tharkun__ · 14 days ago

This reminds me of another recent comment in some other post, about doctors not diagnosing "hard to diagnose" things.

There are probably ("good") reasons for this. But your own persistence, and today the help of AI, can potentially help you. The problem with it is the same problem as previously: "charlatans". Just that today the charlatan and the savior are both one and the same: The AI.

I do recognize that most people probably can't tell one from the other. In both cases ;)

You'll find this in my post history a few times now but essentially: I was lethargic all the time, got migraine type headaches "randomly" a lot. Having the feeling I'd need to puke. One time I had to stop driving as it just got so bad. I suddenly was no longer able to tolerate alcohol either.

I went to multiple doctors, was sent to specialists, who all told me that they could maaaaaybe do test XYX but essentially: It wasn't a thing, I was crazy.

Through a lot of online research I "figured out" (and that's an over-statement) that it was something about the gut microbiome. Something to do with histamine. I tried a bunch of things, like I suspected it might be DAO (Di-Amino-Oxidase) insufficiency. I tried a bunch of probiotics, both the "heals all your stuff" and "you need to take a single strain or it won't work" type stuff. Including "just take Actimel". Actimel gave me headaches! Turns out one of the (prominent) strains in there makes histamine. Guess what, Alcohol, especially some, has histamines and your "hangover" is also essentially histamines (made worse by the dehydration). And guess what else, some foods, especially some I love, contain or break down into histamines.

So I figured that somehow it's all about histamines and how my current gut microbiome does not deal well with excess histamines (through whichever source). None of the doctors I went to believed this to be a "thing" nor did they want to do anything about it. Then I found a pro-biotic that actually helped. If you really want to check what I am taking, check the history. I'm not a marketing machine. What I do believe is that one particular bacterium helped, because it's the one thing that wasn't in any of the other ones I took: Bacillus subtilis.

A soil based bacterium, which in the olden times, you'd have gotten from slightly not well enough cleaned cabbage or whatever vegetable du jour you were eating. Essentially: if your toddler stuffs his face with a handful of dirt, that's one thing they'd be getting and it's for the better! I'm saying this, because the rest of the formulation was essentially the same as the others I tried.

I took three pills per day, breakfast, lunch and dinner. I felt like shit for two weeks, even getting headaches again. I stuck with it. After about two weeks I started feeling better. I think that's when my gut microbiome got "turned around". I was no longer lethargic and I could eat blue cheese and lasagna three days in a row with two glasses of red wine and not get a headache any longer! Those are all foods that contain or make lots of histamine. I still take one per day and I have no more issues.

But you gotta get to this, somehow, through all of the bullshit people that try to sell you their "miracle cure" stuff. And it's just as hard as trying to suss out where the AI is bullshitting you.

There was exactly a single doctor in my life, who I would consider good in that regard. I had already figured the above one out by that time but I was doing keto and it got all of my blood markers, except for cholesterol into normal again. She literally "googled" with me about keto a few times, did a blood test to confirm that I was in ketosis and in general was just awesome about this. She was notoriously difficult to book and later than any doctor for schedules appointments, but she took her time and even that would not really ever have been enough to suss out the stuff that I figured out through research myself if you ask me. While doctors are the "half gods in white", I think there's just way too much stuff and way too little time for them. It's like: All the bugs at your place of work. Now imagine you had exactly one doctor across a multitude of companies. Of course they only figure out the "common" ones ...

tonyhart7 · 13 days ago

after reading your comment, my perception is mixed

rubatuga · 14 days ago

If it was inflamed would your GGT level be high?

fn-mote · 14 days ago

> I fed my wife's blood work results into chatgpt and it came back with a terrifying diagnosis

I don't get it... a doctor ordered the blood work, right? And surely they did not have this opinion or you would have been sent to a specialist right away. In this case, the GP who ordered the blood work was the gatekeeper. Shouldn't they have been the person to deal with this inquiry in the first place?

I would be a lot more negative about "the medical establishment" if they had been the ones who put you through the trauma. It sounds like this story is putting yourself through trauma by believing "Dr. GPT" instead of consulting a real doctor.

I will take it as a cautionary tale, and remember it next time I feed all of my test results into an LLM.

kolinko · 14 days ago

At least in Poland, I can almost always see my results before my doctor does - I get a notification that the labwork is ready and I can view results online.

Also, the regular bloodwork is around $50-$100 (for noninsured or without a prescription), so many people just do this out of pocket once in a while and only bring to doctor if anything looks suspicious.

Finally, there is EU regulation about data that applies to medical field as well - you always have the right to view all the data that any company has stored about you. Gatekeeping is forbidden by law.

vineyardmike · 14 days ago

You don't need a doctor to order bloodwork. I get a full panel done yearly, just to establish a baseline and trend. I try not to overanalyze it, and just keep it around for a professional in case some real issue arises in the future.

themafia · 14 days ago

> it stuck to its guns

It gave you a probabilistic output. There were no guns and nothing to stick to. If you had disrupted the context with enough countervailing opinion it would have "relented" simply because the conversational probabilities changed.

tstrimple · 14 days ago

I was amused but not impressed when I was able to convince Claude Code that it was useless and absolutely not a service worth paying for. It literally apologized and recommended I ask for a refund. I mean, I still get lots of value from CC. Just that it's easy to push them into whatever corner you want.

nprateem · 14 days ago

It's amazing this still needs to be said, especially here

Deleted Comment

SchemaLoad · 14 days ago

I asked a doctor friend why it seems common for healthcare workers to keep the results sheets to themself and just give you a good/bad summary. He told me that the average person can't properly understand the data and will freak themselves out over nothing.

smt88 · 13 days ago

I'm in the US and have never experienced anyone keeping results to themselves.

In fact, I can now easily access even my doctor's appointment notes. I have my entire chart unless my doctor specifically writes private notes.

worldsavior · 14 days ago

I think it's your problem you got stressed from a probabilistic machine answering with what you want to hear.

josefresco · 13 days ago

I fed about 4ish years of blood tests into an AI and after some back and forth it identified a possible issue that might signal recovery. I sheepishly brought it up with my doc, who actually said it might be worth looking into. Nothing earth shattering, just another opinion.

lugu · 14 days ago

I am sorry I have to say so, but the value of LLM is their ability to reason based on their context. Don't use them as smart wikipedia (without context). To your use case, provide them with different textbook and practice handbook and with the medical history of the person. Then ask your question in a neutral way. Then ask it to verify their claim in another session and provide references.

It is so unfortunate that a general chatbot designed to answer anything was the first use case pushed. I get it when people are pissed.

fouc · 14 days ago

> it stuck to its guns

Everyone that encounters this needs to do a clean/fresh prompt with memory disabled to really know if the LLM is going to consistently come to the same conclusion or not.

pengaru · 13 days ago

> A year or so ago, I fed my wife's blood work results into chatgpt

Why would you consult a known bullshit generator for anything this important?

eleveriven · 13 days ago

Stories like yours are why I'm skeptical of these "health insight" products as currently shipped. Visualization, explanation, question-generation - great. Acting like an interpreter of incomplete medical data without a strong refusal mode is genuinely dangerous

mrguyorama · 13 days ago

>The whole thing was a borderline traumatic ordeal that I'm still pretty pissed about.

Why did you do the thing people calmly explained you should not do? Why are you pissed about experiencing the obvious and known outcome?

In medicine, even a test with "Worrying" results is rarely an actual condition requiring treatment. One reason doctors are so bad at long tail conditions is that they have been trained, both by education and literal direct experience, that chasing down test results without any symptoms is a reliable way to waste money, time, and emotions.

It's a classic statistics 101 topic to look at screening tests and notice that the majority of "positive" outcomes are false positives.

daveguy · 14 days ago

Please keep telling your story. This is the kind of shit that medical science has been dealing with for at least a century. When evaluating testing procedures false positives can have serious consequences. A test that's positive every time will catch every single true positive, but it's also worthless. These LLMs don't have a goddamn clue about it. There should be consequences for these garbage fires giving medical advice.

maerF0x0 · 14 days ago

Part of the issue is taking it's output as conclusion rather than as a signal / lead.

I would never let an LLM make an amputate or not decision, but it could convince me to go talk with an expert who sees me in person and takes a holistic view.

irjustin · 14 days ago

Isn't it two sides to the same coin?

You should be happy about it that it's not the thing specifically when the signs pointed towards it being "the thing"?

themafia · 14 days ago

You are _absolutely_ going to die in the next 30 minutes.

When it doesn't happen will you still be happy?

ltbarcly3 · 13 days ago

It's interesting because presumably you were too ashamed to tell the doctor "we pasted stuff into chatgpt and it said it means she is sick", because if you had said that he would have looked at the bloodwork and you could have avoided going to a specialist.

It's an interesting cognitive dissonance that you both trusted it enough to go to a specialist but not enough to admit using it.

Deleted Comment

orionsbelt · 14 days ago

> "A year or so ago"

What model?

Care to share the conversation? Or try again and see how the latest model does?

jesterson · 14 days ago

Never ceases to surpise me why people taking word salad output so seriously.

And probably the same people laugh at ancient folks carefully listening to shamans.

bigbuppo · 14 days ago

Why not just ask WebMD?

port11 · 13 days ago

You used a predictive/statistical proximity chatbot on a single point-in-time snapshot of her blood, and you’re pissed that the result wasn’t useful? I think any decent GP would push back, want to see trends in the data, or at least look at the broader context.

I mean, at some point we have to admit that LLMs aren’t designed for correctness but utility.

terribleperson · 14 days ago

Do you have a custom prompt/personality set? What is it?

ltbarcly3 · 13 days ago

Yea, if only he had said "make sure you are always honest" first!

filoeleven · 13 days ago

Gotta love the replies to this. At least more of the botheads are now acting like they're trying to ask helpful questions instead of just flat out saying "you're using it wrong."

gizajob · 14 days ago

You’re pissed about your own stupidity? In asking for deep knowledge and medical advice from a Markov chain?

Thirty participants wore an Apple Watch for 5-10 days to generate a VO2 max estimate. Subsequently, they underwent a maximal exercise treadmill test in accordance with the modified Åstrand protocol. The agreement between measurements from Apple Watch and indirect calorimetry was assessed using Bland-Altman analysis, mean absolute percentage error (MAPE), and mean absolute error (MAE). Overall, Apple Watch underestimated VO2 max, with a mean difference of 6.07 mL/kg/min (95% CI 3.77–8.38). Limits of agreement indicated variability between measurement methods (lower -6.11 mL/kg/min; upper 18.26 mL/kg/min). MAPE was calculated as 13.31% (95% CI 10.01–16.61), and MAE was 6.92 mL/kg/min (95% CI 4.89–8.94). These findings indicate that Apple Watch VO2 max estimates require further refinement prior to clinical implementation. However, further consideration of Apple Watch as an alternative to conventional VO2 max prediction from submaximal exercise is warranted, given its practical utility.

chrisfosterelli · 14 days ago

wawayanda · 14 days ago

freedomben · 14 days ago

> Despite having access to my weight, blood pressure and cholesterol, ChatGPT based much of its negative assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise. Apple says it collects an “estimate” of VO2 max, but the real thing requires a treadmill and a mask. Apple says its cardio fitness measures have been validated, but independent researchers have found those estimates can run low — by an average of 13 percent.

There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).

What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.

brandonb · 14 days ago

FWIW, Apple has published validation data showing the Apple Watch's estimate is within 1.2 ml/kg/min of a lab-measured Vo2Max.

Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...

itchyouch · 14 days ago

The trick with the vo2 max measurement on the apple watch though is that the person can not waste any time during their outdoor walk and needs to maintain a brisk pace.

Then there's confounders like altitude, elevation gain that can sully the numbers.

It can be pretty great, but it needs a bit of control in order to get a proper reading.

ignoramous · 14 days ago

The paper itself: https://www.apple.com/healthcare/docs/site/Using_Apple_Watch...

Seems like Apple's 95% accuracy estimate for VO2 max holds up.

https://pmc.ncbi.nlm.nih.gov/articles/PMC12080799/

aeonfox · 14 days ago

> I think the blame more rests on Apple for falsely representing the quality of their product

There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.

> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.

The lack of self-consistency does seem like a sign of a deeper issue with reliability. In most fields of machine learning robustness to noise is something you need to "bake in" (often through data augmentation using knowledge of the domain) rather than get for free in training.

freedomben · 13 days ago

> There was plenty of other concerning stuff in that article.

Yeah for sure, I probably didn't make it clear enough but I do fault OpenAI for this as much as or maybe more than Apple. I didn't think that needed to be stressed since the article is already blasting them for it and I don't disagree with most of that criticism of OpenAI.

AndrewKemendo · 14 days ago

> Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.

This is better described as “critical thinking” in its formal form.

You could also call it skepticism.

That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.

Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.

I mostly agree with you, but I think it's important to consider what you're doing with the data. If we're doing rigorous science, or making life-or-death decisions on it, I would 100% agree. But if we're an AI chatbot trying to offer some insight, with a big disclaimer that "these results might be wrong, talk to your doctor" then I think that's quite overkill. The end result would be no (potential) insight at all and no chance for ever improving since we'll likely never get a to a point where we could fully trust the data. Not even the best medical labs are always perfect.

miltonlost · 14 days ago

> What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.

I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)

OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).

jdub · 14 days ago

Why do you have those expectations?

jayd16 · 14 days ago

Well if it doesn't know the quality of the data and especially if it would be dangerous to guess then it should probably say it doesn't have an answer.

I don't disagree, but that reinforces my point above I think. If AI has to assume the data is of poor quality, then there's no point in even trying to analyze it. The options are basically:

1. Trust the source of the data to be honest about it's quality

2. Distrust the source

Approach number 2 basically means we can never do any analysis on it.

Personally I'd rather have a product that might be wrong than none at all, but that's a personal preference.

hmokiguess · 14 days ago

I have been sitting and waiting for the day these trackers get exposed as just another health fad that is optimized to deliver shareholder value and not serious enough for medical grade applications

NoPicklez · 14 days ago

I don't see how they are considered a health fad, they're extremely useful and accurate enough. There are plenty of studies and real world data showing Garmin VO2Max readings being accurate to 1-2 points different to a real world test.

There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.

cthalupa · 13 days ago

I'll preface this with I generally trust doctors. I think on the whole they are well positioned to provide massive benefit to their patients.

I will also preface this with saying I do not think any LLM is better than the average doctor and that you are far better served going to your doctor than asking ChatGPT what your health is like on any factor.

But I'll also say that the quality of doctors varies massively, and that a good amount of doctors learn what they learn in school and do not keep up with the latest advances in research, particularly those that have broad spectrums such as GPs. LLMs that search scientific literature, etc., might point you in the direction of this research that the doctors are not aware of. Or hallucinate you into having some random disease that impacts 3 out of every million people and send you down a rabbithole for months.

Unfortunately, it's difficult to resolve this without extremely good insurance or money to burn. The depth you get and the level of information that a good preventative care cardiologist has is just miles ahead of where your average family medicine practitioner is at. Statins are an excellent example - new prescriptions are for atorvastatin are still insanely high despite it being a fairly poor choice in comparison to rosuvastatin or pitavastatin for a good chunk of the people on it. They often are behind on the latest recommendations from the NLA and AHA, etc.

There's a world where LLMs or similar can empower everyday people to talk to their doctor about their options and where they stand on health, where they don't have to hope their doc is familiar with where the science has shifted over the past 5-10 years, or cough up the money for someone who specializes in it. But that's not the world of today.

In the mean time, I do think people should be comfortable being their own advocates with their doctors. I'm lucky enough that my primary care doc is open to reading the studies I send over to him on things and work with me. Or at least patient enough to humor me. But it's let me get on medications that treat my symptoms without side effects and improved my quality of life (and hopefully life/healthspan). There's also been things I've misinterpreted - I don't pick a fight with him if we come to opposite conclusions. He's shown good faith in agreeing with me where it makes sense to me, and pushed back where it hasn't, and I acknowledge he's the expert.

I interviewed for Ada, whose ML diagnostic tool had shown itself more accurate at diagnosis than a panel of doctors. It was specifically trained on case data, IIRC, and doctors were paid to help improve the results.

I wonder what it’s like now. Any time I use it for a diagnosis I get outlandish results, and then I’ll head to my GP and turns out it was something rather simple.

biophysboy · 13 days ago

I think the fairest test is: what is the best and fastest way to reduce medical uncertainty? For rare ailments with a single cause and exclusive symptoms, that can be accurately described with simple language (no medical jargon), its possible that an LLM is better than a doctor.

For more ambiguous situations where you need actual tests, I am skeptical of using LLMs.

cameldrv · 14 days ago

I dunno, if the Apple Watch said he had a vo2max of 30, that probably means he can’t run a mile in less than 12 minutes or so. He’s probably not at all healthy…

smcl · 14 days ago

Apple Watch is pretty poor at estimating VO2 max and it seems to be more correlated with how often you record exercises with said watch than with your actual health. For example I watched mine climb slowly as I prepared for my football season (beyond 50), then after the season started I I ended up playing and training just as frequently but without wearing the watch. After a few weeks (of me training and playing hard) during my next run it recorded me having a sharp decline in VO2 max (43-44ish iirc). When I started wearing it during training - you're not permitted during matches - it recorded me having a slow return to condition, without any changes to my routine.

That said if it's showing someone as having 30 I don't imagine they're going to be in spectacular condition

I really don’t know whether to trust that specific measurement. When I was a very active runner and doing intervals to improve per-km time, my VO2max went from 38 to 42. I decided to do a professional VO2max test and got a 46.

Now, 2 years later, I don’t run due to injury and a kid, and it’s resting at 34. For reference, when I went to the gym almost everyday and ran once or twice a week, the value was 32.

I don’t get much utility out of it, even looking at the trends. Not sure what Apple is doing behind the scenes to get the score.

This is really more of an "utdoor run while wearing the watch" proxy than a true fitness measure

Someone · 14 days ago

> he had a vo2max of 30, that probably means he can’t run a mile in less than 12 minutes or so. He’s probably not at all healthy…

Health and fitness correlate but are different things. VO2max is more about fitness than about health.

Also, looking at https://en.wikipedia.org/wiki/VO2_max#Reference_values, 30 is about average for men in their 40s/50s, which, form a quick google, I estimate is the author’s age range.

> Also, looking at https://en.wikipedia.org/wiki/VO2_max#Reference_values, 30 is about average for men in their 40s/50s, which, form a quick google, I estimate is the author’s age range.

And the average man is his 40s or 50s is in...not especially good aerobic shape.

netdevphoenix · 13 days ago

Fitness correlates with health though. Just because you don't have any conditions does not mean that you are healthy. And inability to meet certain fitness tests is correlated with lower health.

danielmarkbruce · 13 days ago

This is a silly take. VO2 max is one of the strongest predictors of all cause mortality. Various large scale studies have shown it to be true.

akshivb · 13 days ago

I had a "below average" VO2 max score based on my Apple Watch measurements. It was ~40 mL/kg/min, in the span of about a month it jumped up to 53 mL/kg/min, which is "high" for my age group. So what happened? I started running instead of cycling as my primary form of cardio.

My hypothesis is that the apple watch estimates higher if you are running rather than pedaling. I definitely don't think my cardio vascular went from poor to great over a month. It seems more likely that it was maybe underestimating, and perhaps now is overestimating.

mdtancsa · 13 days ago

After a long injury, I got back to slowly running on the treadmill/bike/elliptical at the gym. IIRC, my garmin qualified its VO2Max results by saying I needed to run out side for some period of time to get a more accurate measurement. I guess there is something about the running metrics it collects that has a smaller error range.

wincy · 13 days ago

Yeah I just ignore it, when I was biking 40+ miles a week this summer it says my VO2 max was 18, which is just absurd. Maybe because my arm is really hairy I don’t know.

dgxyz · 14 days ago

If Apple watch said anything about that it's probably wrong. It can't accurately measure VO2 max.

Incidentally I got rid of mine recently. It is bliss not having one.

Also VO2 max is a crappy measure of fitness. I apparently had "average" VO2 max after a treadmill test. I can hike 50km with a 2km elevation gain in one go and not die. People with higher VO2 max I know, dropped out.

evandijk70 · 13 days ago

During a 50 km hike you are not anywhere close to your VO2 max, so it makes sense that the VO2 max is not predictive for that distance.

lurking_swe · 13 days ago

You’re not wrong. However - the Health app on the iphone (where you can view your health data) makes this VERY clear. Most people just don’t read.

I’ll quote:

“This is a measurement of your VO2 max, which is the maximum amount of oxygen your body can consume during exercise. Also called cardiorespiratory fitness, this is a useful measurement for everyone from the very fit to those managing illness.

A higher VO, max indicates a higher level of cardio fitness and endurance.

Measuring VO2 max requires a physical test and special equipment. You can also get an estimated VO, max with heart rate and motion data from a fitness tracker. Apple Watch can record an estimated VO max when you do a brisk walk, hike, or run outdoors.

VO, max is classified for users 20 and older. Most people can improve their VO, max with more intense and more frequent cardiovascular exercise. Certain conditions or medications that limit your heart rate may cause an overestimation of your VO, max. VOz max is not validated for pregnant users. You can indicate you're taking certain medications or add a current pregnancy in Health Details.”

bwv848 · 13 days ago

> hike 50km with a 2km elevation gain in one go and not die.

And thru-hikers can do this for days. It’s more related to fatigue resistance, mitochondrial density, and walking efficiency. But VO2 max still matters in high-intensity sports, you can’t ignore it when you’re pedaling a bike at high Zone 4 in a race.

vo2 max is one of the strongest predictors of all cause mortality.

mr_toad · 13 days ago

Compared to the average patient a typical GP sees, someone who can actually run a mile is probably doing pretty well.

This is certainly true in the US, but I don't think it's universal at all

sinuhe69 · 14 days ago

My general take on any AI/ML in medicine is that without a proper clinical validation, they are not worth to try. Also, AI Snake Oil is worth reading.

Clinical validation, proper calibration, ethnic and community and population variants, questioning technique and more ...

joelthelion · 14 days ago

Exactly. There's a lot of potential, but it needs to be done right, otherwise it is worse than useless.

seemaze · 14 days ago

I can't wait until it starts recommending signing me up for an OpenAI personalized multi-vitamin® supscription

meindnoch · 13 days ago

"You're absolutely right! I was mistaken about mercury and lead being essential minerals, and adding them to your supplements. Sorry about that!"

seemaze · 11 days ago

I was casually browsing for a heath monitor when I came across Ultrahuman Blood Vision - one of the key features being AI Clinical Summary with Supplement Report.. it seems my sarcasm was late to the party.

elzbardico · 14 days ago

LLMs are not a mythical universal machine learning model that you can feed any input and have it magically do the same thing a specialized ML model could do.

You can't feed an LLM years of time-series meteorological data, and expect it to work as a specialized weather model, you can't feed it years of medical time-series and expect it to work as a model specifically trained, and validated on this specific kind of data.

An LLM generates a stream of tokens. You feed it a giant set of CSVs, if it was not RL'd to do something useful with it, it will just try to make whatever sense of it and generate something that will most probably have no strong numerical relationship to your data, it will simulate an analysis, it won't do it.

You may have a giant context windows, but attention is sparse, the attention mechanism doesn't see your whole data at the same time, it can do some simple comparisons, like figuring out that if I say my current pressure is 210X180 I should call an ER immediately. But once I send it a time-series of my twice a day blood-pressure measurements for the last 10 years, it can't make any real sense of it.

Indeed, it would have been better for the author to ask the LLM to generate a python notebook to do some data analysis on it, and then run the notebook and share the result with the doctor.

rfw300 · 14 days ago

This is true as a technical matter, but this isn't a technical blog post! It's a consumer review, and when companies ship consumer products, the people who use them can't be expected to understand failure modes that are not clearly communicated to them. If OpenAI wants regular people to dump their data into ChatGPT for Health, the onus is on them to make it reliable.

> the onus is on them to make it reliable.

That is not a plausible outcome given the current technology or of any of OpenAI's demonstrated capabilities.

"If Bob's Hacksaw Surgery Center wants to stay in business they have to stop killing patients!"

Perhaps we should just stop him before it goes too far?

Deklomalo · 13 days ago

You state a lot of things without testing it first?

A LLM has structures in its latent space which allows it to do basic math, it has also seen enough data that it has probably structures in it to detect basic trends.

A LLM doesn't just generate a stream of tokens. It generates an embedding and searches/does something in its latent space, then returns tokens.

And you don't even know at all what LLM Interfaces do in the background. Gemini creates sub-agents. There can easily be already a 'trend detector'.

I even did a test and generated random data with a trend and fet it to chatgpt. The output was very coherent and right.

elzbardico · 13 days ago

That's not how it works.

protocolture · 14 days ago

This LLM is advertising itself in a medical capacity. You arent wrong, but the customer has been fed the wrong set of expectations. Its the fault of the marketing of the tool.