ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
I would love to open a workspace. Full stop.
However, due to the price of the shredder and the tools required to transform the plastic into new forms; One needs to have a dedicated space with a lot of power. Then you need to secure a source of plastic. You would think this part would be easy, I mean that is the whole premise of this org's existence, right? You would be wrong in that assumption. There is big money in "recycling" in the US. From the collection, sorting, and distribution of recycled materials... someone already has a contract to legally "do it."
I am bummed to see them in this position. There seems to be a few hotspots around the world where this would really work. They aren't near me, that is for sure.
I meant when you download a gguf file from huggingface, instead of using a model from ollama's library.
https://github.com/ollama/ollama/blob/main/docs%2Fmodelfile....
Here is my workflow when using Open WebUI:
1. ollama show qwen3:30b-a3b-q8_0 --modelfile
2. Paste the contents of the modelfile into -> admin -> models -> OpenwebUI and rename qwen3:30b-a3b-q8_0-monkversion-1
3. Change parameters like num_gpu 90 to change layers... etc.
4. Keep | Delete old file
Pay attention to the modelfile, it will show you something like this: # To build a new Modelfile based on this, replace FROM with: # FROM qwen3:30b-a3b-q8_0 and you need to make sure the paths are correct. I store my models on a large nvme drive that isn't default ollama as an example of why that matters.
EDIT TO ADD: The 'modelfile' workflow is a pain in the booty. It's a dogwater pattern and I hate it. Some of these models are 30 to 60GB and copying the entire thing to change one parameter is just dumb.
However, ollama does a lot of things right and it makes it easy to get up and running. VLLM, SGLang, Mistral.rs and even llama.cpp require a lot more work to setup.
A wallet that only lasts 10 years seems disposable at this point.
My teens use these little things that attach to their phones to hold gym key, debit card and ID.
I use a traditional "wallet" or billfold as my abuelo used to call them, but I am positively a dinosaur using one. Also, the darn thing hurts my back if I leave it in my back pocket while driving/sitting.
Heck, I have been eyeing those crossbody bags or saccoche to hold the things that are in my wallet.
There are a bazzillion and one hardware combinations where even RAM timings can make a difference. Offloading a small portion to a GPU can make a HUGE difference. Some engines have been optimized to run on Pascal with CUDA compute below 7.0, and some have tricks for newer gen cards with modern CUDA. Some engines only run on Linux while others are completely x-platform. It is truly the wild-west of combinatorics as they relate to hardware and software. It is bewildering to say the least.
In other words, there is no clear "best" outside of a DGX and Linux software stack. The only way to know anything right now is to test and optimize for what you want to accomplish by running a local llm.
As a learning project, copy a successful project and mind map it, program it and find customers. You will learn so much that you can apply to whatever project you come up with. You'll get frustrated because you will need to go out and learn a lot but power thru it. You'll be better off for the hard work. Good luck!
[1] Lacking Any Major Excitement