Readit News logoReadit News
panabee commented on Oral Microbes Linked to 3-Fold Increased Risk of Pancreatic Cancer   nyulangone.org/news/oral-... · Posted by u/bmau5
panabee · 3 months ago
The association between pathogens and cancer is under-appreciated, mostly due to limitations in detection methods.

For instance, it is not uncommon for cancer studies to design assays around non-oncogenic strains, or for assays to use primer sequences with binding sites mismatched to a large number of NCBI GenBank genomes.

Another example: studies relying on The Cancer Genome Atlas (TCGA), which is a rich database for cancer investigations. However, the TCGA made a deliberate tradeoff to standardize quantification of eukaryotic coding transcripts but at the cost of excluding non-poly(A) transcripts like EBER1/2 and other viral non-coding RNAs -- thus potentially understating viral presence.

Enjoy the rabbit hole. :)

panabee commented on Are elites meritocratic and efficiency-seeking? Evidence from MBA students   arxiv.org/abs/2503.15443... · Posted by u/bikenaga
panabee · 3 months ago
A more accurate title: "Are Cornell Students Meritocratic and Efficiency-Seeking? Evidence from 271 MBA students and 67 Undergraduate Business Students."

This topic is important and the study interesting, but the methods exhibit the same generalizability bias as the famous Dunning-Kruger study.

The referenced MBA students -- and by extension, the elites -- only reflect 271 students across two years, all from the same university.

By analyzing biased samples, we risk misguided discourse on a sensitive subject.

@dang

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
jacobr1 · 5 months ago
Not a answer, but contributory idea - Meta-analysis. There are plenty of strong meta-analysis out there and one of the things they tend to end up doing is weighing the methodological rigour of the papers along with the overlap they have to the combined question being analyzed. Could we use this weighting explicitly in the training process?
panabee · 5 months ago
Thanks. This is helpful. Looking forward to more of your thoughts.

Some nuance:

What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.

Worse, who decides?

To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
briandear · 5 months ago
> this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth

Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.

panabee · 5 months ago
Valid critique, but one addressing a problem above the ML layer at the human layer. :)

That said, your comment has an implication: in which fields can we trust data if incentives are poor?

For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?

These are hard questions.

ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.

Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
bjourne · 5 months ago
What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.
panabee · 5 months ago
If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.

Every fact is born an opinion.

This challenge exists in most, if not all, spheres of life.

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
empiko · 5 months ago
This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.
panabee · 5 months ago
100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.

(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
panabee · 5 months ago
This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.

panabee · 5 months ago
To elaborate, errors go beyond data and reach into model design. Two simple examples:

1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".

2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.

There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.

We need way more people thinking about biomedical AI.

panabee commented on AI groups spend to replace low-cost 'data labellers' with high-paid experts   ft.com/content/e17647f0-4... · Posted by u/eisa01
panabee · 5 months ago
This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.

panabee commented on AI Market Clarity   blog.eladgil.com/p/ai-mar... · Posted by u/todsacerdoti
panabee · 5 months ago
The author is a respected voice in tech and a good proxy of investor mindset, but the LLM claims are wrong.

They are not only unsupported by recent research trends and general patterns in ML and computing, but also by emerging developments in China, which the post even mentions.

Nonetheless, the post is thoughtful and helpful for calibrating investor sentiment.

panabee commented on Biomni: A General-Purpose Biomedical AI Agent   github.com/snap-stanford/... · Posted by u/GavCo
govideo · 5 months ago
ML first, then Bio and Data. Of course, interconnectedness runs high (eg just read about ML for non-random missingness in med records) and that data is the foundational bottleneck/need across the board.

Interesting anecdote abt Stanford doctors annotating QA question!

Each of your comments get my mind going... I'm going to think about them more and may ping you on other channels, per your profile. Thanks!

panabee · 5 months ago
More like alarming anecdote. :) Google did a wonderful job relabeling MedQA, a core benchmark, but even they missed some (e.g., question 448 in the test set remains wrong according to Stanford doctors).

For ML, start with MedGemma. It's a great family. 4B is tiny and easy to experiment with. Pick an area and try finetuning.

Note the new image encoder, MedSigLIP, which leverages another cool Google model, SigLIP. It's unclear if MedSigLIP is the right approach (open question!), but it's innovative and worth studying for newcomers. Follow Lucas Beyer, SigLIP's senior author and now at Meta. He'll drop tons of computer vision knowledge (and entertaining takes).

For bio, read 10 papers in a domain of passion (e.g., lung cancer). If you (or AI) can't find one biased/outdated assumption or method, I'll gift a $20 Starbucks gift card. (Ping on Twitter.) This matters because data is downstream of study design, and of course models are downstream of data.

Starbucks offer open to up to three people.

u/panabee

KarmaCake day4313July 4, 2011
About
X.com/panabee

Founder, Hotpot.ai and HotpotBio (Hotpot.ai/bio)

Better data for lung cancer: https://hotpot.ai/non-smoking-lung-cancer (We don't financially benefit, only trying to plug holes in the system).

Reevaluating Viral Etiology and Rational Diagnosis of ME/CFS and Low-grade Brain Parenchymal Inflammation: message for pre-publication access.

EBV + breast cancer: https://www.biorxiv.org/content/10.1101/2024.11.28.625954v2

Please message for free Hotpot credits. HN is an invaluable source of knowledge. I would be honored to give back.

View Original