Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.
One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.
Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.
You: "I'm sorry, I don't have an encyclopedia."
I'm starting to think you're 270M.