mikhail-ramirez (u/mikhail-ramirez)

dust42 · 24 days ago

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

mikhail-ramirez · 24 days ago

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text