Yet I see other people with less resources like 10GB of vram and 32gb system ram fitting the 120b model onto their hardware.
Perhaps its because ROCm isn't really supported by ollama for RDN4 architecture yet? I believe I'm using vulkan to currently run and it seems to use my CPU more than my GPU at the moment. Maybe I should just ask it all this.
I'm not complaining too much because it's still amazing I can run these models. I just like pushing the hardware to its limit.
Not a major setback because for long context I'd just use GPT or claude, but it would be cool to have 128k context locally on my machine. When I get a new CPU I'll upgrade RAM to 64, my GPU is more than capable of what I need for a while and a 5090 or 4090 is the next step up in VRAM but I don't want to shell out 2k for a card.