""" In March 2024 Kimer Med announced it has signed a contract valued at up to USD$750,000 (NZD$1.3 million) with Battelle Memorial Institute (Battelle), the world’s largest independent, nonprofit research and development organization. The contract is focused on the discovery and development of new antiviral drug candidates for the treatment of alphaviruses. """
For the data - Orthrus is trained on non experimentally collected data so our pre-training dataset is large by biological standards. It adds up to about 45 million unique sequences and assuming 1k tokens per sequence it's about 50b tokens.
We're thinking about this as large pre-training run on a bunch of annotation data from Refseq and Gencode in conjunction with more specialized Orthology datasets that are pooling data across 100s of species.
Then for specific applications we are fine tuning or doing linear probing for experimental prediction. For example we can predict half life using publicly available data collected by the awesome paper from: https://genomebiology.biomedcentral.com/articles/10.1186/s13...
Or translation efficency: https://pubmed.ncbi.nlm.nih.gov/39149337/
Eventually as we ramp up out wet lab data generation we're thinking about what does post-training look like? There is an RL analog here that we can use on these generalizable embeddings to demonstrate "high quality samples".
There are some early attempts at post-training in bio and I think it's a really exciting direction
I'm curious about what your strategy is for data collection to fuel improved algorithmic design. Are you building out experimental capacity to generate datasets in house, or is that largely farmed out to partners?
There is also one widespread approach that isn't mentioned in the article: expansion microscopy. Expansion takes the scifi-sounding approach of: what if you could make your sample physically bigger? See the Wikipedia page for more: https://en.wikipedia.org/wiki/Expansion_microscopy
For instance, Evo2 by the Arc Institute is a DNA Foundation Model that can do some really remarkable things to understand/interpret/design DNA sequences, and there are now multiple open weight models for working with biomolecules at a structural level that are equivalent to AlphaFold 3.
What complicates things is the experimental data we get back from labs to validate MD behavior is extremely tricky to work with. Most of what we're working with is NMR data which shows flexibility in areas of the proteins, but even then we're left with these mathematical models to attempt to "make sense" of the flexibility and infer dynamics from that. Sometimes it feels like an art and a science trying to get meaningful insights for lab data like this.
It's extremely difficult to experimentally verify any MD model since, as mentioned in the article, most of the data we're working with are static mugshots in the form of crystal structures.
I'm curious if you've worked with any of those models and how they relate to NMR data and MD simulations.
And an Editorial piece (more technical than the NYT): https://www.nejm.org/doi/full/10.1056/NEJMe2505721