phi4-mini-reasoning took the same prompt and bailed out because (at least according to its trace) it interpreted it as meaning "can't have a, e, i, o, or u in the name".
Local is the only inference paradigm I'm interested in, but these things have a way to go.
The tool tells the agent to ask the user for it, and the agent cannot proceed without it. The instructions from the tool show an all caps message explaining the risk and telling the agent that they must prompt the user for the OTP
I haven't used any of the *Claws yet, but this seems like an essential poor man's human-in-the-loop implementation that may help prevent some pain
I prefer to make my own agent CLIs for everything for reasons like this and many others to fully control aspects of what the tool may do and to make them more useful
* Many top quality tts and stt models
* Image recognition, object tracking
* speculative decoding, attached to a much bigger model (big/small architecture?)
* agentic loop trying 20 different approaches / algorithms, and then picking the best one
* edited to add! Put 50 such small models to create a SOTA super fast model
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
* By training on user data, you can source specific model data images, and then train & classify the airplane model. It might require another model, where only the bbox will be the input, together with distance/calculated measurements of the object ("pixel size"), and the orientation of the plane (side? front? belly?).
* Provide alerts/notification of special aircrafts like helicopters, military, airforce-1, etc'
* When bbox is detected, you can run super-resolution upscaling on the photo/stream of images
First place I worked right out of college had a big training seminar for new hires. One day we were told the story of how they’d improved load times from around 5min to 30seconds, this improvement was in the mid 90s. The negative responses from clients were instant. The load time improvements had destroyed their company culture. Instead of everyone coming into the office, turning on their computers, and spending the next 10min chatting and drinking coffee the software was ready before they’d even stood up from their desk!
The moral of the story, and the quote, isn’t that you shouldn’t improve things. Instead it’s a reminder that the software you’re building doesn’t exist in a PRD or a test suite. It’s a system that people will interact with out there in the world. Habits with form, workarounds will be developed, bugs will be leaned for actual use cases.
This makes it critically important that you, the software engineer, understand the purpose and real world usage of your software. Your job isn’t to complete tickets that fulfill a list of asks from your product manager. Your job is to build software that solves users problems.
First - cases where expected+old+new are identical, should go to regression suite. Now a HUMAN should take a look in this order: 1. Cases where expected+old are identical, but rust is different. 2. If time allows - Cases where expected+rust are identical , but old is different.
TBH, after #1 (expected+old, vs. rust) I'd be asking the GenAI to generate more test cases in these faulty areas.
Worth mentioning in the title that it's CPU-only: >1200 tokens/s on a single thread is impressive.
Have you considered doing optimization iterations like nanogpt-speedrun? Would be interesting to see how far you can push the performance.