OP says “I taught LLM how to see” and this should mean the LLM (which is capable of being taught/learning) internalized how to. It did not, it was given a tool that does seeing and tells it what things are.
People are very interested in getting good local LLMs with vision integrated, and so they want to read about it. Next to nobody would click on the honest “I enabled an LLM to use a Google service to identify objects in images”, which is what OP actually did.
I'm under the impression I'm being hampered by a separation of 'brain' and 'eyes', as I have yet to find a reasoning + vision local model that fits on my Mac, and played with two instances of qwen (vision and reasoning) to try to solve, but no real breakthroughs yet. The requirements I've given myself are fully local models, and no reading data from the ROM that the human player cannot be aware of.
I was hoping OP was able to retro-fit vision onto blind models, not just offload it to a cloud model. It's still an interesting write-up, but I for sure got click-baited
"Nature Show Host": not David Attenborough, surprisingly
"Compelling Lady": nothing beats a Jet2 Holiday
"Upset Girl": this is more the voiceover that would be used on depressing animal charity adverts
"Magnetic Man": you can't fool me, that's an American
"Patient Man": patience gives you reverb. The word "British" is spoken with a very non-British accent.
Not to be all Henry Higgins, but these are all "placeless" accents and there are no regional accent options. I was looking forward to trying Computer Mancunian. But I can see why for marketing voiceover people want "global neutral British".
UX review: "failed to generate speech". Only the example phrases work.
Kettering accent generator when :(