Ollama started out as the clean "just run a model locally" story, but now it’s branching into paid web search, which muddies the open-source appeal. Windows ML goes the other direction: deep OS integration, but at the cost of tying your stack tightly to the Windows ecosystem, very reminiscent of DirectX.
Now, the real question is whether vLLM/ONNX or just running straight on CUDA/ROCm are the only alternatives or we are all trading one vendor lock-in with another.
It's only a matter of time before the LLMs start having paid product placement in their results. Even open-source - the money needed to train & operate these models is just too enormous.
Could there be a distributed training system run by contributors, like the various @Home projects? Yeah, decent chance of that working, especially with the widespread availability of fiber connections. But then to query the model you still need a large low-latency system (i.e. not distributed) to host it, and that's expensive.
I have extensive experience building hardware accelerated AI inference pipelines on Windows, including on the now retired DirectML (not a great loss). One thing I learned is that the hardware vendor support promised in press released is often an outright lie, and that reporting any kinds of bugs or missing functionality will only result in a infinite blame game loop between Microsoft developer support and the hardware vendors. It appears that Windows is no longer a platform that anyone feels responsible for or takes pride in maintaining, so your best hope is to build a web app on top of Chromium, so that at least you will have Google on your side when something inevitably breaks.
System ONNX might be quite nice for Windows applications, provided the backends are actually reliable on most systems. AMD currently has three options for example (ROCm, MIGraphX, and Vitis), and I've never gotten any of them to work. (Although MIGraphX looks to be no longer marked as experimental so maybe I should give it another try.)
How does Windows ML compare to just using something like Ollama plus an LLM that you download to your device (which seems like it would be much simpler)? What are the privacy implications of using Windows ML with respect to how much of your data it is sending back to Microsoft?
It’s kind of a bummer because this is the exact same playbook as DirectX, which ended up being a giant headache for the games industry, and now everyone is falling for it again.
It is not llm specific. A large swathe of it isn’t that much Microsoft specific either.
And it is a developer feature hidden from end users.
e.g. - In your ollama example, does the developer ask end users to install ollama? Does the dev redistribute ollama and keep it updated?
The ONNX format is pretty much a boring de-facto standard for ML model exchange. It is under the linux foundation.
The ONNX Runtime is a microsoft thing, but it is an MIT licensed runtime for cross language use and cross OS/HW platform deployment of ML models in the ONNX format.
That bit needs to support everything because Microsoft itself ships software on everything.(Mac/linux/iOS/Android/Windows.
The primary value claims for Windows ML (for a developer using it)—
This eliminates the need to:
Bundle execution providers for specific hardware vendors
Create separate app builds for different execution providers
Handle execution provider updates manually.
Since ‘EP’ is ultra-super-techno-jargon:
Here is what GPT-5 provides:
Intensional (what an EP is)
In ONNX Runtime, an Execution Provider (EP) is a pluggable backend that advertises which ops/kernels it can run and supplies the optimized implementations, memory allocators, and (optionally) graph rewrites for a specific target (CPU, CUDA/TensorRT, Core ML, OpenVINO, etc.). ONNX Runtime then partitions your model graph and assigns each partition to the highest-priority EP that claims it; anything unsupported falls back (by default) to the CPU EP.
Extensional (how you use them)
• You pick/priority-order EPs per session; ORT maps graph pieces accordingly and falls back as needed.
• Each EP has its own options (e.g., TensorRT workspace size, OpenVINO device string, QNN context cache).
• Common EPs: CPU, CUDA, TensorRT (NVIDIA), DirectML (Windows), Core ML (Apple), NNAPI (Android), OpenVINO (Intel), ROCm (AMD), QNN (Qualcomm).
It's an optimized backend for running LLMs, much like CoreML on macOS, which has been received very positively due to the acceleration it enables (ollama/llama.cpp use it).
Since this uses ONNX you probably won't be able to use ollama directly with it, but conceptually you could use an app like it to run your models in a more optimized way.
Funny how the only mention of privacy in the post is this -
This ability to run models locally enables developers to build AI experiences that are more responsive, private and cost-effective, reaching users across the broadest range of Windows hardware.
Correct me if I'm wrong, but if the LF AI & Data Foundation (Linux Foundation) ONNX working groups support advanced quantization (down to 4-bit grouped schemes, like GGUF's Q4/Q5 formats), standardize flash attention and similar fused ops, and allow efficient memory-mapped weights through the spec and into ONNX Runtime, then Windows ML and Apple Core ML could become a credible replacement for GGUF in local-LLM land.
> Windows ML is the built-in AI inferencing runtime optimized for on-device model inference...lets both new and experienced developers build AI-powered apps
This sounds equivalent to Apple's announcement last week about opening up access for any developer to tap into the on-device large language model at the core of Apple Intelligence[1]
No matter the device, this is a win-win for developers making & consumers getting privacy-focused apps
This is good news, hopefully this come with much better integration with Nvidia and also non-Nvidia AI/ML eco-system including the crucial driver, firmware and toolkit as discussed in this very recent HN posting [1].
[1] Docker Was Too Slow, So We Replaced It: Nix in Production [video]
Now, the real question is whether vLLM/ONNX or just running straight on CUDA/ROCm are the only alternatives or we are all trading one vendor lock-in with another.
Could there be a distributed training system run by contributors, like the various @Home projects? Yeah, decent chance of that working, especially with the widespread availability of fiber connections. But then to query the model you still need a large low-latency system (i.e. not distributed) to host it, and that's expensive.
It is the evolution of DirectX for ML, previously known as DirectML.
And it is a developer feature hidden from end users. e.g. - In your ollama example, does the developer ask end users to install ollama? Does the dev redistribute ollama and keep it updated?
The ONNX format is pretty much a boring de-facto standard for ML model exchange. It is under the linux foundation.
The ONNX Runtime is a microsoft thing, but it is an MIT licensed runtime for cross language use and cross OS/HW platform deployment of ML models in the ONNX format.
That bit needs to support everything because Microsoft itself ships software on everything.(Mac/linux/iOS/Android/Windows.
ORT — https://onnxruntime.ai
Here is the Windows ML part of this —https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/...
The primary value claims for Windows ML (for a developer using it)— This eliminates the need to: Bundle execution providers for specific hardware vendors
Create separate app builds for different execution providers
Handle execution provider updates manually.
Since ‘EP’ is ultra-super-techno-jargon:
Here is what GPT-5 provides:
Intensional (what an EP is)
In ONNX Runtime, an Execution Provider (EP) is a pluggable backend that advertises which ops/kernels it can run and supplies the optimized implementations, memory allocators, and (optionally) graph rewrites for a specific target (CPU, CUDA/TensorRT, Core ML, OpenVINO, etc.). ONNX Runtime then partitions your model graph and assigns each partition to the highest-priority EP that claims it; anything unsupported falls back (by default) to the CPU EP.
Extensional (how you use them) • You pick/priority-order EPs per session; ORT maps graph pieces accordingly and falls back as needed. • Each EP has its own options (e.g., TensorRT workspace size, OpenVINO device string, QNN context cache). • Common EPs: CPU, CUDA, TensorRT (NVIDIA), DirectML (Windows), Core ML (Apple), NNAPI (Android), OpenVINO (Intel), ROCm (AMD), QNN (Qualcomm).
Since this uses ONNX you probably won't be able to use ollama directly with it, but conceptually you could use an app like it to run your models in a more optimized way.
This ability to run models locally enables developers to build AI experiences that are more responsive, private and cost-effective, reaching users across the broadest range of Windows hardware.
This sounds equivalent to Apple's announcement last week about opening up access for any developer to tap into the on-device large language model at the core of Apple Intelligence[1]
No matter the device, this is a win-win for developers making & consumers getting privacy-focused apps
[1] https://www.apple.com/newsroom/2025/09/new-apple-intelligenc...
Thus C#, C++ and Python support as WinRT projections on top of the new API.
[1] Docker Was Too Slow, So We Replaced It: Nix in Production [video]
https://news.ycombinator.com/item?id=45398468