We are Michael & Jono, and we are building Cerebrium (https://www.cerebrium.ai), a serverless infrastructure platform for ML/AI applications - we make it easy for engineers to build, deploy and scale AI applications.
Initially, we’ve been hyper-focused on the inference side of applications, but we’re working on expanding our functionality to support training and data processing use cases—eventually covering the full AI development lifecycle.
You can watch a quick loom video of us deploying: https://www.loom.com/share/06947794b3bf4bb1bb21c87066dfcc66?...
How we got here:
Jono and I led the technical team at our previous e-commerce startup, which grew rapidly over a few years. As we scaled, we were tasked with building out ML applications to make the business more efficient. It was tough—every day felt like a defeat. We found ourselves stitching together AWS Lambda, Sagemaker, and Prefect jobs (this stack alone was enough to make me want to give up). By the time we reached production, the costs were too high to maintain. Getting these applications live required a significant upfront investment of both time and money, making it inaccessible for most startups and scale-ups to attempt. We wanted to create something that would help us (and others like us) implement ML/AI applications easily and cost-effectively.
The problem:
There are a ton of challenges to tackle to realize our vision, but we’ve initially focused on a few key ones:
1. GPUs are expensive – An A100 is 326 times the cost of a CPU, and companies are using LLMs like they’re simple APIs. Serverless instances solve this to an extent, but minimizing cold starts is difficult.
2. Local development – Engineers need local development environments to iterate quickly, but production-grade GPUs aren’t available on consumer hardware. How can we make cloud deployments feel as fast as just saving a file locally and retrying?
3. Cost to experiment - To run experiments we had to spin up EC2 instances each day, recreate our environment and run scripts. It was difficult to monitor logs, instance usage metrics as well as run large processing jobs or scale endpoints without a significant infrastructure investment. Additionally, we often forget to switch off instances which cost us money!
Our Approach
We have three core areas that we are focused on which we believe are the most important for any infrastructure platform:
1. Performance:
We have worked hard to get our added network latency <50ms and the cold start of our average workloads to 2-4 seconds. Here are a few things we did to get our cold starts so low:
- Container Runtime: We built our own container runtime that splits container images into two parts—metadata and data blobs. Metadata provides the file structure, while the actual data blobs are fetched on-demand. This allows containers to start before the full image is downloaded. In the background, we prefetch the remaining blobs.
- Caching: Once an image is on a machine, it’s cached for future use. This makes subsequent container startups much faster. We also intelligently route requests to machines where the image is already cached.
- Efficient Inference: We route requests to the optimal machines, prioritizing low-latency and high-throughput performance. If no containers are immediately available, we efficiently queue the requests through our task scheduling system.
- Distributed Storage Cache: One of the most resource-intensive parts of AI workloads is loading models into VRAM. We use NVME drives (which are much faster than network volumes), as close as possible to the machines and we orchestrate workloads to nodes that already contain the necessary model weights where possible.
2. Developer Experience
We built Cerebrium to help developers iterate as quickly as possible by streamlining the entire build and deployment process.
To get build times as low as possible, we use high-performance machines and cache layers where possible. We've reduced first-time build times to an average of 2 minutes and 24 seconds, with subsequent builds completing in just 19 seconds.
We also offer a wide range of GPU types—over 8 different options—so you can easily test performance and cost efficiency by adjusting a single line in your configuration file.
To reduce friction, we’ve kept things simple. There are no custom Python decorators, no Cerebrium specific syntax to learn. You just add a .toml file to define your hardware requirements and environment settings. This makes migrating onto or off our platform just as easy as migrating off. We aim to impress you enough that you will want to stay.
3. Stability
This is arguably more important than the first two areas - no one wants to get an email at 11pm at night or on a Saturday that their application is down or degraded. Since April, we’ve maintained 99.999% uptime. We have redundancies in place, monitoring, alerts, and a team that covers all time zones to resolve any issues quickly.
Why Is This Hard?
Building Cerebrium has been challenging because it involves solving multiple interconnected problems. It requires optimization at every step—from efficiently splitting images to fetching data on-demand without introducing latency, handling distributed caching, optimizing our network stack, and ensuring redundancies, all while holding true to the three areas mentioned above.
Pricing:
We charge you exactly for the resources you need and only charge you when your code is running ie: usage-based. For example, if you specify you need 1 A100 GPU, with 2 CPUs and 12 GB of RAM we charge you exactly for that and not a full A100 (12 CPU’s and 148GB of memory)
You can see more about our pricing here: http://www.cerebrium.ai/pricing
What’s Next?
We're builders too, and we know how crucial support can be when you're working on something new. Here's what we've put together to support teams like yours:
- $30 in free credit to start exploring. If you're onto something interesting but need more runway, just give us a shout - we’d be happy to extend that for compelling use-cases.
- We have worked hard on our docs to make onboarding easy as well as have a very elaborate Github repo covering AI voice agents, LLM optimizations and much more.
Docs: https://docs.cerebrium.ai/cerebrium/getting-started/introduc...
Github Examples: https://github.com/CerebriumAI/examples/tree/master
If you have a question or hit a snag, you can directly reach out to the engineers who built the platform—we’re here to help! We’ve also set up Slack and Discord communities where you can connect with other creators, share experiences, ask for advice, or just chat with folks building cool things.
We're looking forward to seeing what you all build and please give us feedback on what you would like us to improve/add
In terms of cold starts, we seem to be very comparable from what users have mentioned and tests we have run.
Easier config/setup is feedback we have gotten from users since we don't have and special syntax or a "Cerebrium way" of doing things which makes migration pretty easier as well as doesn't lock you in which some engineers appreciate. We just run your Python code as is with an extra .toml setup file.
Additionally, we offer AWS Inferentia/Tranium nodes which offer a great price/performance trade-offs for many open-Source LLM's - even when using TensorRT/vLLM on Nvidia GPU's and gets rid of the scarcity problem. We plan to support TPU's and others in future.
We are listed on AWS Marketplace as well as others which means you can subtract your Cerebrium cost from your commited cloud spend.
Two things we are working on that will hopefully make us a bit different is: - GPU checkpointing - Running compute in your own cluster to use credits/for privacy concerns.
Where Modal does really shine is training/data-processing use cases which we currently don't support too well. However, we do have this on our roadmap for the near future.
However, some of the situations you would like to use Cerebrium over Skypilot are: - You don't want to manage you own hardware - Reduced costs: With serverless Runtime and low cold starts (unclear if SkyPiolet offers this and what the peformance is like if they do) - Rapid iteration: Unclear of the deployment process on SkyPilot and how long projects take to go live - Observability: Looks like you would just have k8s metrics at your disposal
- clearer messaging - more tutorials - one-click deploys - clear & upfront costing
We have plans to add other runtimes (like Typescript) in the future but Python is our focus for now.
Related to that, it seems the syntax isn't documented https://docs.cerebrium.ai/cerebrium/environments/config-file...
You can see an example config file at the bottom of that link you attached - agreed we should probably make it more obvious
As for the quoting part, it's mysterious to me why a structured file would use a quoted string for what is obviously an interior structure. Imagine if you opened a file and saw
wouldn't you strongly suspect that there was some interior syntax going on there?Versus the sane encoding of:
in a normal markup language, no "inner/outer quoting" nonsense requiredBut I did preface it with my toml n00b-ness and I know that the toml folks believe they can do no wrong, so maybe that's on purpose, I dunno
The support is next level - team is ready to dive into any problem, response is super fast, and has helped us solve a bunch of dev problems that a normal platform probably won’t.
Really excited to see this one grow!!
We're definitely looking for something like this as we're looking to transition from Azure's (expensive) GPUs. I'm curious how you stack against something like Runpod's serverless offering (which seems quite a bit cheaper). Do you offer faster cold starts? How long would a ~30GB model load takes?
In terms of cold starts, they mentioned their cold starts are 250ms which I am not sure what workload that is on, or if we have the same measure of cold starts. We have had quite a few customers that we have told us we are quite a bit faster 2-4 seconds vs ~10 seconds although we haven't confirmed this ourselves.
For a 30GB model, we have a few ways to speed this up such as using the Tensorizer framework from Coreweave, we cache model files in our distributed caching layer but I would need to test. We see reads of up to 1GB/s. If you tell me the model you are running (if open-source) I can get results to you - you can message me on our Slack/Discord community or email me at michael@cerebrium.ai or
I may be misunderstanding your explanation a bit here, but Runpod's serverless "flex" tier looks like the same model (it only charges you for non-idle resources). And at that tier they are still 2x cheaper for A100, at your price point with them you could rent an H100.
When you ran it the first time, it took a while to load up. Do subsequent runs go faster?
And what cloud provider are you all using under the hood? We work in a specific sector that excludes us from using certain cloud providers (ie. AWS) at my company.
We are running on top of AWS however can run on top of any cloud provider as well as are working on you using your own cloud. Happy to hear more about your use case and see if we can help you at all - email me at michael@cerebrium.ai.
PS: I will state that vLLM has shocking load times into VRam that we are resolving.
I just shared this on Slack and it looks like the site description has a typo: "A serverless AI infrastructure platform [...] customers experience a 40%+ cost savings as opposed to AWS of GCP"