Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?