I'm looking at using OCI at $DAY_JOB for model distribution for fleets of machines also so it's good to see it's getting some traction elsewhere.
OCI has some benefits over other systems, namely that tiered caching/pull-through is already pretty battle-tested as is signing etc, beating more naive distribution methods for reliability, performance and trust.
If combined with eStargz or zstd::chunked it's also pretty nice for distributed systems as long as you can slice things up into files in such a way that not every machine needs to pull the full model weights.
Failing that there are P2P distribution mechanisms for OCI (Dragonfly etc) that can lessen the burden without resorting to DIY on Bittorrent or similar.
That is exactly the feature we are using, right now you need to be on a beta release of containerd but before long it should be pretty widespread.
In combination with lazy pull (eStargz) it's a pretty compelling implementation.
Damn, that's handy. I now wonder how much trouble making a CSI driver that does this would be for backporting to the 1.2x clusters (since I don't think that kubernetes does backports for anything)
I've been pretty disappointed with eStargz performance, though... Do you have any numbers you can share? All over the internet people refer to numbers from 10 years ago, from workloads that don't seem realistic at all. In my experiments it didn't provide a significant enough speedup.
In our case some machines would need to access less than 1% of the image size but being able to have an image with the entire model weights as a single artifact is an important feature in and of itself. In our specific scenario even if eStargz would be slow by filesystem standards it's competing with network transfer anyway so if it's the same order of magnitude as rsync that will do.
I don't have any perf numbers I can share but I can say we see ~30% compression with eStargz which is already a small win atleast heh.
Be aware of licensing restrictions. Docker Desktop is free for personal use, but it requires a paid license if you work for an organization sized 250+. This feature seems to be available in Docker Desktop only.
Note: I'm part of the team developing this feature.
Soon (end of May, according to the current roadmap) this feature will also be available with the Docker Engine (so not only as part of Docker Desktop).
As a reminder, Docker Engine is the Community Edition, Open Source and free for everyone.
My understanding has always been that Docker Engine was only available directly on Linux. If you are running another operating system then you will need to run Docker Desktop (which, in turn, runs a Docker Engine instance in a VM).
This comment kind of makes it sound like maybe you can run Docker Engine directly on these operating systems (MacOS, Windows, etc.), is that the case?
I don't understand why add another domain-specific command to a container manager and go out of scope for what the tool was designed for at first place.
The main benefit I see for cloud platforms: caching/co-hosting various services based on model instead of (model + user's API layer on top).
For the end user, it would be one less deployment headache to worry about: not having to package ollama + the model into docker containers for deployment. Also a more standardized deployment for hardware accelerated models across platforms.
(disclaimer: I'm leading the Docker Model Runner team at Docker)
It's fine to disagree of course, but we envision Docker as a tool that has a higher abstraction level than just container management. That's why having a new domain-specific command (that also uses domain-specific technology that is independent from containers, at least on some platform targets) is a cohesive design choice from our perspective.
They are using OCI artifacts to package models, so you can use your own registry to host these models internally. However, I just can't see any improvement comparing with a simple FTP server. I don't think the LLM models can adopt hierarchical structures like Docker, and thus cannot leverage the benefits of layered file systems, such as caching and reuse.
It's not the only one using OCI to package models. There's a CNCF project called KitOps (https://kitops.org) that has been around for quite a bit longer. It solves some of the limitations that using Docker has, one of those being that you don't have to pull the entire project when you want to work on it. Instead, you can pull just the data set, tuning, model, etc.
They imply it should be somehow optimized for apple silicon, but, yeah, I don't understand what this is. If docker can use GPU, well, it should be able to use GPU in any container that makes use of it properly. If (say) ollama as an app doesn't use it properly, but they figured a way to do it better, it would make more sense to fix ollama. I have no idea why this should be a different app than, well, the very docker daemon itself.
All that work (AGX acceleration...) is done in llama.cpp, not ollama. Ollama's raison d'être is a docker-style frontend to llama.cpp, so it makes sense that Docker would encroach from that angle.
Can’t say I'm a fan of packaging models as docker images. Feels forced - a solution in search of a problem.
The existing stack - a server and model file - works just fine. There doesn’t seem to be a need to jam an abstraction layer in there. The core problem docker solves just isn’t there
(disclaimer: I'm leading the Docker Model Runner team at Docker)
We are not packaging models as Docker images, since indeed that is the wrong fit and comes with all kinds of technical problems. It also feels wrong to pure package data (which models are) into an image, which generally expects to be a runnable artifact.
That's why we decided to use OCI Artifacts, and specify our own OCI Artifact subset that is better suited for the use case. The spec and implementation is OSS, you can check it out here: https://github.com/docker/model-spec
Is this really a Docker feature, though? llama.cpp provides acceleration on Apple hardware, I guess you could create a Docker image with llama.cpp and an LLLM model and have mostly this feature.
I'm going to take a contrarian perspective to the theme of comments here...
There are currently very good uses for this and likely going to be more. There are increasing numbers of large generative AI models used in technical design work (e.g., semiconductor rules based design/validation, EUV mask design, design optimization). Many/most don't need to run all the time. Some have licensing that is based on length of time running, credits, etc. Some are just huge and intensive, but not run very often in the design glow. Many are run on the cloud but industrial customers are remiss to run them on someone else's cloud
Being able to have my GPU cluster/data center be running a ton of different and smaller models during the day or early in the design, and then be turned over to a full CFD or validation run as your office staff goes home seems to be to be useful. Especially if you are in anyway getting billed by your vendor based on run time or similar. It can mean a more flexible hardware investment. The use casae here is going to be Formula 1 teams, silicon vendors, etc. - not pure tech companies.
I wonder if the adult kids of some Docker execs own Macs, and they make it. Why on Earth make this not for the larger installed OSes, you know, the ones running Docker in production?
(disclaimer: I'm leading the Docker Model Runner team at Docker)
We decided to start with Apple silicon Macs, because they provide one of the worst experiences of running LLMs in a containerized form, while at the same time having very capable hardware, so it felt like a very sad situation for Mac users (because of the lack of GPU access within containers).
And of course we understand who our users are, so believe me when I say, macOS users on Apple silicon make up a significant portion of our user case, else we would not have started with it.
In production environments on Docker CE, you can already mount the GPUs, so while the UX is not great, it is not a blocker.
However, we have first class support for Docker Model Runner within Docker CE on our roadmap and we hope it comes sooner rather than later ;)
It will also be purely OSS, so no worries there.
OCI has some benefits over other systems, namely that tiered caching/pull-through is already pretty battle-tested as is signing etc, beating more naive distribution methods for reliability, performance and trust.
If combined with eStargz or zstd::chunked it's also pretty nice for distributed systems as long as you can slice things up into files in such a way that not every machine needs to pull the full model weights.
Failing that there are P2P distribution mechanisms for OCI (Dragonfly etc) that can lessen the burden without resorting to DIY on Bittorrent or similar.
(I ended up developing an alternative pull mechanism, which is described in https://outerbounds.com/blog/faster-cloud-compute though note that the article is a bit light on the technical details)
I don't have any perf numbers I can share but I can say we see ~30% compression with eStargz which is already a small win atleast heh.
Soon (end of May, according to the current roadmap) this feature will also be available with the Docker Engine (so not only as part of Docker Desktop).
As a reminder, Docker Engine is the Community Edition, Open Source and free for everyone.
This comment kind of makes it sound like maybe you can run Docker Engine directly on these operating systems (MacOS, Windows, etc.), is that the case?
For the end user, it would be one less deployment headache to worry about: not having to package ollama + the model into docker containers for deployment. Also a more standardized deployment for hardware accelerated models across platforms.
It's fine to disagree of course, but we envision Docker as a tool that has a higher abstraction level than just container management. That's why having a new domain-specific command (that also uses domain-specific technology that is independent from containers, at least on some platform targets) is a cohesive design choice from our perspective.
Seems fair to raise 1bn at a valuation of 100bn. (Might roll the funds over into pitching Kubernetes, but with Ai next month)
The existing stack - a server and model file - works just fine. There doesn’t seem to be a need to jam an abstraction layer in there. The core problem docker solves just isn’t there
We are not packaging models as Docker images, since indeed that is the wrong fit and comes with all kinds of technical problems. It also feels wrong to pure package data (which models are) into an image, which generally expects to be a runnable artifact.
That's why we decided to use OCI Artifacts, and specify our own OCI Artifact subset that is better suited for the use case. The spec and implementation is OSS, you can check it out here: https://github.com/docker/model-spec
There is at least one benefit. I'd be interested to see what their security model is.
There are currently very good uses for this and likely going to be more. There are increasing numbers of large generative AI models used in technical design work (e.g., semiconductor rules based design/validation, EUV mask design, design optimization). Many/most don't need to run all the time. Some have licensing that is based on length of time running, credits, etc. Some are just huge and intensive, but not run very often in the design glow. Many are run on the cloud but industrial customers are remiss to run them on someone else's cloud
Being able to have my GPU cluster/data center be running a ton of different and smaller models during the day or early in the design, and then be turned over to a full CFD or validation run as your office staff goes home seems to be to be useful. Especially if you are in anyway getting billed by your vendor based on run time or similar. It can mean a more flexible hardware investment. The use casae here is going to be Formula 1 teams, silicon vendors, etc. - not pure tech companies.
We decided to start with Apple silicon Macs, because they provide one of the worst experiences of running LLMs in a containerized form, while at the same time having very capable hardware, so it felt like a very sad situation for Mac users (because of the lack of GPU access within containers).
And of course we understand who our users are, so believe me when I say, macOS users on Apple silicon make up a significant portion of our user case, else we would not have started with it.
In production environments on Docker CE, you can already mount the GPUs, so while the UX is not great, it is not a blocker.
However, we have first class support for Docker Model Runner within Docker CE on our roadmap and we hope it comes sooner rather than later ;) It will also be purely OSS, so no worries there.