BioGeek (u/BioGeek) - Readit News

BioGeek commented on Semantic search engine for ArXiv, biorxiv and medrxiv arxivxplorer.com/... · Posted by u/0101111101

0101111101 · 7 months ago

Sadly I couldn't find a public API for chemrxiv, but would be happy to be proven wrong!

BioGeek · 7 months ago

Here it is:

https://chemrxiv.org/engage/chemrxiv/public-api/documentatio...

BioGeek commented on Nucleotide Transformer: building robust foundation models for human genomics nature.com/articles/s4159... · Posted by u/bookofjoe

bilsbie · a year ago

Cool! I don’t understand what the genomic tasks it solves are. What can it actually do?

Also can we train this same model on regular language data so we can converse about the genomes? I suppose a normal multi modal model can talk about what it sees in images in english. Could we have a similar thing with genomes? Ie DNA is just another modality in a multimodal.

BioGeek · a year ago

> Also can we train this same model on regular language data so we can converse about the genomes?

Yes! That is what has been done in ChatNT [1] where you can ask natural language questions like "Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5." and the ChatNT will answer with "The degradation rate for this sequence is 1.83."

> My biggest point of confusion is what type of practical things these models can do.

See for example this notebook [2] where the Nucleotide Transformer is finetuned to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types.

Disclaimer: I work at InstaDeep but was not involved in either of the above projects.

[1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2 [2] https://github.com/huggingface/notebooks/blob/main/examples/...

BioGeek commented on For chemists, the AI revolution has yet to happen nature.com/articles/d4158... · Posted by u/bookofjoe

__MatrixMan__ · 3 years ago

I imagine a similar approach could be taken re: genomics / proteomics. Thousands of tiny bioreactors with slightly different genetics, temperature, nutrients, etc, all piped to chromatography equipment and optimizing for the metabolic pathway of some desirable. Maybe blast 'em with gamma and try to catch a lucky mutation, etc.

Edit: I'm not the only one imagining such a thing: https://www.sciencedirect.com/science/article/pii/S095816692...

BioGeek · 3 years ago

For a recently published example of this see [1]: an automated platform, called Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE), can design and build proteins using AI agents and robotics. In an initial proof-of-concept, it was used to make glycoside hydrolase (sugar-cutting) enzymes that can withstand higher-than-normal temperatures.

The SAMPLE system used four different autonomous agents, each of which designed slightly different proteins. These agents search the fitness landscape for a protein and then proceed to test and refine it over 20 cycles. The entire process took just under six months. It took one hour to assemble genes for each protein, one hour to run PCR, three hours to express the proteins in a cell-free system, and three hours to measure each protein’s heat tolerance. That’s nine hours per data point! The agents had access to a microplate reader and Tecan automation system, and some work was also done at the Strateos Cloud Lab.

SAMPLE made sugar-cutting enzymes that could tolerate temperatures 10°C higher than even the best natural sequence, called Bgl3. The AI agents weren’t “told” to enhance catalytic efficiency, but their designs also had catalytic efficiencies that matched or exceeded Bgl3.

[1] https://www.biorxiv.org/content/10.1101/2023.05.20.541582v1 [2] https://www.readcodon.com/i/122504181/ai-agents-design-prote...

BioGeek commented on Self-Driving Cars with Duckietown: Learning Autonomy on the Jetson Nano duckietown.org/mooc... · Posted by u/ArtWomb

AndreaCensi · 5 years ago

I am Andrea Censi, one of the creators (https://censi.science). Feel free to AMA!

BioGeek · 5 years ago

I have a DonkeyCar [1] with a Jetson Nano 4GB. Is it possible to follow the course with that hardware platform?

[1] https://www.donkeycar.com/

BioGeek commented on Novel coronavirus complete genome from the Wuhan outbreak available in GenBank ncbiinsights.ncbi.nlm.nih... · Posted by u/mikhailfranco

Smerity · 6 years ago

Champion! Thank you =]

BioGeek · 6 years ago

There is a small error in the code. The variable `count` should be defined on line 11 like:

    count = int(record["Count"])

en the appearance on line 15 should be removed.

BioGeek commented on Anatomy of a Scam jacquesmattheij.com/anato... · Posted by u/cocoflunchy

BioGeek · 6 years ago

All those pages that you found of people having exactly that same combination of educational background can be simply explained. They are the default sample content of the Team page on certain Wordpress themes (Therefore also the WP in ‘Blockchain WP’). A quick Google search shows that for example the FinanceCo theme from Radius has the same exact Education listing.[0]

So the other profiles you found could just be sloppy webmasters who didn't remove the default Team pages of their Wordpress theme.

[0] https://www.radiustheme.com/demo/wordpress/themes/financeco/...

BioGeek commented on Novel coronavirus complete genome from the Wuhan outbreak available in GenBank ncbiinsights.ncbi.nlm.nih... · Posted by u/mikhailfranco

Smerity · 6 years ago

Does anyone know of an easy way to download the nucleotide sequences for the entire set of pneumonia / coronavirus viruses? I've looked at the FTP mirror but can't connect it with the nucleotide locus I find on the site itself.

I work in language modeling and want to see about using those unsupervised / self-supervised methods for genome annotation / phylogenetic tree construction.

Even if it's more of a curiousity than a useful tool I have experience with small datasets, most recently focused on character level language modeling on ~90MB of training data, so if I can get (90MB / (29 kilobases * 2 bits per base) =) approximately 12,000 related samples I should be able to at least make a dataset out of it.

BioGeek · 6 years ago

Using Biopython. Note that the search query that I am using currently returns 70605 results, so you might want to tweak it fit your needs.

    from Bio import Entrez
    import time
    from urllib.error import HTTPError
    
    DB = 'nucleotide'
    QUERY = '("pneumoviridae"[Organism] OR "Coronaviridae"[Organism])'
    
    Entrez.email = 'your.email@provider.com'
    handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
    record = Entrez.read(handle)
    
    handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
    record = Entrez.read(handle)

    id_list = record['IdList']
    count = len(id_list)
    post_xml = Entrez.epost(DB, id=",".join(id_list))
    search_results = Entrez.read(post_xml)
    
    webenv = search_results['WebEnv']
    query_key = search_results['QueryKey']

    batch_size = 200
    with open('viruses.fasta', 'w') as out_handle:
        for start in range(0, count, batch_size):
            end = min(count, start+batch_size)
            print(f"Going to download record {start+1} to {end}")
            attempt = 0
            success = False
            while attempt < 3 and not success:
                attempt += 1
                try:
                    fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
                                                 retstart=start, retmax=batch_size,
                                                 webenv=webenv, query_key=query_key)
                    success = True
                except HTTPError as err:
                    if 500 <= err.code <= 599:
                        print(f"Received error from server {err}")
                        print("Attempt {attempt} of 3")
                        time.sleep(15)
                    else:
                        raise
            data = fetch_handle.read()
            fetch_handle.close()
            out_handle.write(data)