While there's prior associations between retinal photos and autism, I'm by default very skeptical of any AI algorithm purporting "100% accuracy". It smells like data leakage.
I would bet that even physicians aren't 100% consistent in their diagnosis of autism. If that's the case, then it should be more or less impossible for any other diagnostic approach to be 100% consistent with the physician diagnoses.
Edit: After reading the study closer, this criticism might be a bit harsh. In their autism subjects, they excluded those with mild/moderate autism. Limiting to severe cases should mean there's a higher degree of confidence/consistency in the diagnoses.
I'm not actually sure where the Petapixel article authors are getting the phrase "100% accuracy", as it does not show up in the article they are writing about, nor does it appear in the only other article they link to. Putting it in quotes makes it seem like a claim the model creators are making about their model, but I don't see them making that claim in general. They say the model matched all the sample data in this case, not that it's 100% accurate—presumably for the same reasons you are hesitant to do so. Unless I'm missing something, Petapixel should correct their headline.
> I would bet that even physicians aren't 100% consistent in their diagnosis of autism.
That's because autism is diagnosed by using the DSM. You can take an x-ray of an arm and see the fracture, but in order to diagnose autism you have to determine 'persistent deficits in social communication and social interaction across multiple contexts'.
It is all dependent on how society defines things, and is fluid (and IMO, somewhat dubious).
I try not to immediately call BS on these types of studies…but in this case there are some concerns.
“The data sets were randomly divided into training (85%) and test (15%) sets. We used 10-fold cross-validation to obtain generalized results of model performance. Data splitting was performed at the participant level and stratified based on the outcome variables. Because the data classes were imbalanced for symptom severity (ADOS-2 and SRS-2), we performed a random undersampling of the data at the participant level before conducting data splitting. Moreover, we examined different split ratios (80:20 and 90:10) to assess the robustness and consistency of the predictive performances across diverse splitting proportions.”
* undersampling is problematic here and probably introduced some bias. These imbalanced class problems are just plain hard. Claiming one hundred percent on an imbalanced class problem should probably cause some concern.
* data split at the participant level has to be done really careful or you’ll over fit
* multiple comparisons bias by testing multiple split ratios on the same test data. Same with the 10-fold cross Val.
* not sure if they validated results on any external test data
* outcome variable stratification also has to be done really carefully or it will introduce bias; seems particularly sensitive in this case
* using severity of symptoms as class labels is problematic. These have to really have been diagnosed the same way / consistently to be meaningful.
I also note a long time history in collection of these images (15 years iirc). Hard to believe such a diverse set of images (collection, equipment etc) led to perfect results.
ML issues aside, super interested in the basic medical concept. I wasn’t aware retinal abnormalities could be indicative of issues like ASD.
> The photography sessions for patients with ASD took place in a space dedicated to their needs, distinct from a general ophthalmology examination room. This space was designed to be warm and welcoming, thus creating a familiar
environment for patients. Retinal photographs of typically developing (TD) individuals were obtained in a general ophthalmology examination room. Each eye required an average of 10–30 s for photography, although some cases involved longer periods to help the patient calm down, sometimes exceeding 5–10 min. All images were captured in a dark room to optimize their quality. Retinal photographs of both patients with ASD and TD were obtained using non-mydriatic fundus cameras, including EIDON (iCare), Nonmyd 7 (Kowa), TRC-NW8 (Topcon), and Visucam NM/FA (Carl Zeiss Meditec).
So two questions:
1. Are we positive that the difference in rooms does not effect these images?
2. If we are in a dark room, and ASD patients are in it for 5-10 minutes longer, are we sure this doesn't effect the retina?
3. Were all cameras used for both ASD and TD images?
Want to make sure the AI is being trained to detect autism, and wasn't accidentally trained to identify camera models, length-in-dark-room or room-welcomingness.
Hopefully not, but I assume you have to be so careful with these sort of things when the model is entirely black-box and you can't actually validate what it's actually doing inside.
This is definitely worthy of concern. There's an infamous case where an AI was trained to detect cancer from imaging but all the positive examples included a ruler (to measure the tumor) so it turned out it just was good at detecting rulers.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674813/#:~:tex....
Reminds me of the classic apocryphal early ML story of the enemy tank detector that was 100% accurate at identifying camouflaged tanks… so long as tanks and sunny weather were perfectly correlated in the input data, just as they were in the training data.
It appears they also report good results for predicting symptom severity. It's less obvious how the cameras etc would leak into severity. Unless it actually works (it does seem a bit too good to be true), I'm thinking the test set was in the base model or something
Came here to say this. 100% is too good to be true and it's almost certainly the AI has figured out a signal leak from the camera, image format, room, etc.
Also concussions according to the article, which is news to this retired former neurosurgical anesthesiologist.(38 years in practice; stopped 2015 at age 67 because I believed [still do]) it's better to retire [from my profession, at least] too early than too late.
There's a famous story (probably apocryphal) about the military of a powerful nation training an early AI to find pictures of submarines beneath the sea.
There was great excitement as it was near 100%.
It later transpired the pictures with submarines in had a white border.
Ha, I've heard a similar story about diagnosing skin cancer from pictures of moles. They were really excited about the performance of the model but it turned out if the dermatologist was concerned about the size of a mole they would include a ruler in the picture to document the size. The NN wasn't trained to "diagnose skin cancer" it was trained to recognize rulers in pictures.
haven't modern model architectures gotten better at avoiding this kind of overfitting? like obviously data quality is still very important, but my understanding is that dropout mitigates this by randomly cutting out these unwanted feature channels. the models learn to distinguish all differences, rather than just one, or fixed combinations of several.
It really doesn't have to do with most ML architectures. It has to do with experiment design. If some data used in testing is part of the training process there will be over fitting. That's why a final test set is required for unbiased evaluation.
> haven't modern model architectures gotten better at avoiding this kind of overfitting?
Overfitting is, AIUI, a training method and data issue, not a model issue alone. I doubt any model is resistant to overfitting if you give it data where the answer is reliably encoded some aspect it can use but outside of what you want it to look at.
Now, you can notice suspicious results and investigate (or you can just publish a 100% success rate and call it a day.)
Am I reading this wrong? They only had validation sets with no final test set which makes the results kinda worthless because we don't know how overfit they were to these validation sets (which can easily happen with any sort of parameter tuning). There is a reason why a proper study needs three splits train/validation (and possible multiple of these if you use k-fold) and a final to be used as sparingly as possible 'test' set.
See the paper, "On estimating model accuracy with repeated cross-validation"
I wonder if physiognomy will come back as a field, if AI scans like these have any validity.
I remember stumbling upon multiple esoteric accounts on both Twitter and Tiktok with communities seemingly obsessed with characterising various psychological traits from purely looking at facial features, importantly without racial undertones.
While this on the surface sounds ridiculous and has various horrible historical echoes, i've always had a hunch there was actually something to this science from a purely intuitive perspective and knowing lots of people - again very importantly disregarding anything about race - instead focusing on the myriad of hormone linked features, neurotypicalism, alcohol, environmental factors, whatever traits that seemingly somehow go "across races".
I would bet that even physicians aren't 100% consistent in their diagnosis of autism. If that's the case, then it should be more or less impossible for any other diagnostic approach to be 100% consistent with the physician diagnoses.
Edit: After reading the study closer, this criticism might be a bit harsh. In their autism subjects, they excluded those with mild/moderate autism. Limiting to severe cases should mean there's a higher degree of confidence/consistency in the diagnoses.
[0] https://jamanetwork.com/journals/jamanetworkopen/fullarticle...
That's because autism is diagnosed by using the DSM. You can take an x-ray of an arm and see the fracture, but in order to diagnose autism you have to determine 'persistent deficits in social communication and social interaction across multiple contexts'.
It is all dependent on how society defines things, and is fluid (and IMO, somewhat dubious).
https://youtu.be/6JPgpasgueQ?si=dn3muYeOe-cSSKM2
Deleted Comment
“The data sets were randomly divided into training (85%) and test (15%) sets. We used 10-fold cross-validation to obtain generalized results of model performance. Data splitting was performed at the participant level and stratified based on the outcome variables. Because the data classes were imbalanced for symptom severity (ADOS-2 and SRS-2), we performed a random undersampling of the data at the participant level before conducting data splitting. Moreover, we examined different split ratios (80:20 and 90:10) to assess the robustness and consistency of the predictive performances across diverse splitting proportions.”
* undersampling is problematic here and probably introduced some bias. These imbalanced class problems are just plain hard. Claiming one hundred percent on an imbalanced class problem should probably cause some concern. * data split at the participant level has to be done really careful or you’ll over fit * multiple comparisons bias by testing multiple split ratios on the same test data. Same with the 10-fold cross Val. * not sure if they validated results on any external test data * outcome variable stratification also has to be done really carefully or it will introduce bias; seems particularly sensitive in this case * using severity of symptoms as class labels is problematic. These have to really have been diagnosed the same way / consistently to be meaningful.
I also note a long time history in collection of these images (15 years iirc). Hard to believe such a diverse set of images (collection, equipment etc) led to perfect results.
ML issues aside, super interested in the basic medical concept. I wasn’t aware retinal abnormalities could be indicative of issues like ASD.
> The photography sessions for patients with ASD took place in a space dedicated to their needs, distinct from a general ophthalmology examination room. This space was designed to be warm and welcoming, thus creating a familiar environment for patients. Retinal photographs of typically developing (TD) individuals were obtained in a general ophthalmology examination room. Each eye required an average of 10–30 s for photography, although some cases involved longer periods to help the patient calm down, sometimes exceeding 5–10 min. All images were captured in a dark room to optimize their quality. Retinal photographs of both patients with ASD and TD were obtained using non-mydriatic fundus cameras, including EIDON (iCare), Nonmyd 7 (Kowa), TRC-NW8 (Topcon), and Visucam NM/FA (Carl Zeiss Meditec).
So two questions:
1. Are we positive that the difference in rooms does not effect these images?
2. If we are in a dark room, and ASD patients are in it for 5-10 minutes longer, are we sure this doesn't effect the retina?
3. Were all cameras used for both ASD and TD images?
Want to make sure the AI is being trained to detect autism, and wasn't accidentally trained to identify camera models, length-in-dark-room or room-welcomingness.
Hopefully not, but I assume you have to be so careful with these sort of things when the model is entirely black-box and you can't actually validate what it's actually doing inside.
Just being in a dark room longer is sufficient to make changes that an AI could pick up on.
Ideally, they should capture the images from children before diagnosis, then see if they can predict the diagnosis.
Deleted Comment
There was great excitement as it was near 100%.
It later transpired the pictures with submarines in had a white border.
Overfitting is, AIUI, a training method and data issue, not a model issue alone. I doubt any model is resistant to overfitting if you give it data where the answer is reliably encoded some aspect it can use but outside of what you want it to look at.
Now, you can notice suspicious results and investigate (or you can just publish a 100% success rate and call it a day.)
See the paper, "On estimating model accuracy with repeated cross-validation"
Even the sklearn docs for cross validation show this split: https://scikit-learn.org/stable/_images/grid_search_cross_va...
Dead Comment
I remember stumbling upon multiple esoteric accounts on both Twitter and Tiktok with communities seemingly obsessed with characterising various psychological traits from purely looking at facial features, importantly without racial undertones.
While this on the surface sounds ridiculous and has various horrible historical echoes, i've always had a hunch there was actually something to this science from a purely intuitive perspective and knowing lots of people - again very importantly disregarding anything about race - instead focusing on the myriad of hormone linked features, neurotypicalism, alcohol, environmental factors, whatever traits that seemingly somehow go "across races".
Or maybe there's nothing there.