The path to ubiquitous AI (17k tokens/sec)

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

vessenes · 21 days ago

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

rbanffy · 21 days ago

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

Gareth321 · 21 days ago

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

ssivark · 21 days ago

> speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious

Can we use older (previous generation, smaller) models as a speculative decoder for the current model? I don't know whether the randomness in training (weight init, data ordering, etc) will affect this kind of use. To the extent that these models are learning the "true underlying token distribution" this should be possible, in principle. If that's the case, speculative decoding is an elegant vector to introduce this kind of tech, and the turnaround time is even less of a problem.

btown · 21 days ago

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

jasonwatkinspdx · 20 days ago

They may be using Rapidus, which is a Japanese government backed foundry built around all single wafer processing vs traditional batching. They advertise ~2 month turnaround time as standard, and as short as 2 weeks for priority.

empath75 · 21 days ago

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

joha4270 · 21 days ago

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

soleveloper · 21 days ago

In 20$ a die, they could sell Gameboy style cartridges for different models.

twalla · 21 days ago

Okay, now _this_ is the cyberpunk future I asked for.

noveltyaccount · 21 days ago

That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.

pennomi · 21 days ago

Make them shaped like floppy disks to confuse the younger generations.

fennecbutt · 20 days ago

Microsoft

merlindru · 21 days ago

dude that would be so incredibly cool

alexjplant · 20 days ago

Most importantly this opens up an amazing future where we get the real version of the classic science fiction MacGuffin of a physical AI chip. Pair this with several TB of flash storage and you have persistent artificial consciousness that can be carried around with you. Bonus points if it's quirky, custom-trained and the chip is one of a kind that you stole from an evil corporation. Additional bonus points if the packaging is such that it's small enough to plug into the USB-C port on your smart glasses and has an eBPF module it can leverage to see what you're doing and talk to you in real time about your actions.

I enjoy envisioning futures more whimsical than "the bargain-basement LLM provider that my insurance company uses denied my claim because I chose badly-vectored words".

jameslk · 21 days ago

> Certainly interesting for very low latency applications which need < 10k tokens context.

I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.

The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle

0. https://arxiv.org/abs/2512.24601

aurareturn · 22 days ago

Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

dust42 · 22 days ago

> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

elternal_love · 22 days ago

Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

varispeed · 22 days ago

There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.

Aissen · 21 days ago

> 880mm^2 die

That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

> The larger the die size, the lower the yield.

I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

rbanffy · 21 days ago

> I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.

Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe

empath75 · 21 days ago

An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

xnx · 21 days ago

Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

WhitneyLand · 21 days ago

There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

gwern · 21 days ago

K-V caches are large, but hidden states aren't necessarily that large. And if you can run a model once ridiculously fast, then you can loop it repeatedly and still be fast. So I wonder about the 'modern RNNs' like RWKV here...

make3 · 20 days ago

It's weird to me to train such huge models to then destroy them by using them a 3 bits quantization per presumably 16bits (bfloat16) weights. Why not just train smaller models then.

pankajdoharey · 21 days ago

There is nothing new here. This has been demonstrated several times by previous researchers:

https://arxiv.org/abs/2511.06174

https://arxiv.org/abs/2401.03868

For a real world use case, you would need an FPGA with terabytes of RAM. Perhaps it'll be a Off chip HBM. But for s large models, even that won't be enough. Then you would need to figure out NV-link like interconnect for these FPGAs. And we are back to square one.

smokel · 20 days ago

This is new. You are citing FPGA prototypes. Those papers do not demonstrate the same class of scaling or hardware integration that Taalas is advocating. For one, the FPGA solutions typically use fixed multipliers (or lookup tables), the ASIC solution has more freedom to optimize routing for 4 bit multiplication.

bsenftner · 21 days ago

Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

oliwary · 22 days ago

This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

pulse7 · 20 days ago

Maybe they can stack LLM parameters in 200 layers like 3D NAND flash and make the chip very small ...

mikhail-ramirez · 21 days ago

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

Tepix · 21 days ago

Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

robotnikman · 21 days ago

Sounds perfect for use in consumer devices.

zozbot234 · 21 days ago

Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

dust42 · 21 days ago

Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.

PhunkyPhil · 21 days ago

I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.

What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

I am also curious about Taalas pricing.

> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Do we have an idea of how much a unit / inference / api will cost?

Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

mike_hearn · 21 days ago

They don't give cost figures in their blog post but they do here:

https://www.nextplatform.com/wp-content/uploads/2026/02/taal...

Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.

> Is there really a potential market for hardware designed for one model only?

I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.

Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.

> Shouldn't there be a more flexible way?

The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.

pigpop · 21 days ago

Yes, there are all kinds of fuzzy NLP tasks that this would be great for. Jobs where you can chunk the text into small units and add instructions and only need a short response. You could burn through huge data sets very quickly using these chips.

himata4113 · 21 days ago

I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.

alfalfasprout · 21 days ago

Cerebras is a totally different product though. They can (theoretically) run any frontier model provided it gets compiled a certain way. Like a wafer scale TPU.

This is using hardwired weights with on-die SRAM used for K/V for example. It's WAY more power efficient and faster. The tradeoff being it's hardwired.

Still, most frontier models are "good enough" where an obscenely fast version would be a major seller.

test001only · 20 days ago

That is my concern too. A chip optimised for a model or specific model architecture will not be useful for long.

ahofmann · 20 days ago

I just tried the demo and I think, this is huge! If they manage to build a chip in 2 or 3 years, that can run something like Opus 4.6 or even Sonnet, at that speed, the disruption in the world of software development will be more than we saw in the last 3-5 years. LLMs today are somewhat useful, but they are still too slow and expensive for a meaningful ralph loop. Being able to runs those loops (or if you want to call it "thinking") much faster, will enable a lot of stuff, that is not feasible today. Writing things like openclaw will not take weeks, but hours. Maybe even rewriting entire tools, kernels or OSes will be feasible because the LLM can run through almost endless tries.

Speed and cost wins over quality and this will also be true for LLMs.

Herring · 21 days ago

If it's so easy to do custom silicon for any model (they say only 2 months), why didn't they demo one of the newer DeepSeek models instead? Using a 2-year model is so bad. I'm not buying it.

robotpepi · 21 days ago

they explain it in the article: this is the first iteration, so they wanted to start with something simple, ie, this is a tech demo.

real-hacker · 20 days ago

They support Lora, it is something.

2. Will Talaas also work with reasoning models, especially those that are beyond 100B parameters and with the output being correct? 3. How long will it take to create newer models to be turned into silicon? (This industry moves faster than Talaas.) 4. How does this work when one needs to fine-tune the model, but still benefit from the speed advantages?

> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.

Alifatisk · 21 days ago

freakynit · 21 days ago

Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

https://chatjimmy.ai/

qingcharles · 21 days ago

I asked it to design a submarine for my cat and literally the instant my finger touched return the answer was there. And that is factoring in the round-trip time for the data too. Crazy.

The answer wasn't dumb like others are getting. It was pretty comprehensive and useful.

  While the idea of a feline submarine is adorable, please be aware that building a real submarine requires significant expertise, specialized equipment, and resources.

it's incredible how many people are commenting here without having read the article. they completely lost the point.

smusamashah · 21 days ago

With this speed, you can keep looping and generating code until it passes all tests. If you have tests.

Generate lots of solutions and mix and match. This allows a new way to look at LLMs.

Retr0id · 21 days ago

Not just looping, you could do a parallel graph search of the solution-space until you hit one that works.

turnsout · 21 days ago

Agreed, this is exciting, and has me thinking about completely different orchestrator patterns. You could begin to approach the solution space much more like a traditional optimization strategy such as CMA-ES. Rather than expect the first answer to be correct, you diverge wildly before converging.

Epskampie · 21 days ago

And then it's slow again to finally find a correct answer...

MattRix · 21 days ago

This is what people already do with “ralph” loops using the top coding models. It’s slow relative to this, but still very fast compared to hand-coding.

otabdeveloper4 · 21 days ago

This doesn't work. The model outputs the most probable tokens. Running it again and asking for less probable tokens just results in the same but with more errors.

amelius · 21 days ago

OK investors, time to pull out of OpenAI and move all your money to ChatJimmy.

A related argument I raised a few days back on HN:

What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?

If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.

raincole · 21 days ago

You mean Nvidia?

rstuart4133 · 21 days ago

> It was literally in a blink of an eye.!!

It's not even close. It takes the eye 100mm .. 400ms to blink. This think takes under 30ms to process a small query - about 10 times faster.

zwaps · 21 days ago

I got 16.000 tokens per second ahaha

gwd · 21 days ago

I dunno, it pretty quickly got stuck; the "attach file" didn't seem to work, and when I asked "can you see the attachment" it replied to my first message rather than my question.

scosman · 21 days ago

It’s llama 3.1 8B. No vision, not smart. It’s just a technical demo.

Hmm.. I had tried simple chat converation without file attachments.

PlatoIsADisease · 21 days ago

Well it got all 10 incorrect when I asked for top 10 catchphrases from a character in Plato's books. It confused the baddie for Socrates.

Rudybega · 21 days ago

Well yeah, they're running a small, outdated, older model. That's not really the point. This approach can be used for better, larger, newer models.

I get nothing, no replies to anything.

Maybe hn and reddit crowd have overloaded them lol

elliotbnvl · 21 days ago

That… what…

b0ner_t0ner · 21 days ago

I asked, “What are the newest restaurants in New York City?”

Jimmy replied with, “2022 and 2023 openings:”

0_0

Well, technically it's answer is correct when you consider it's knowledge cutoff date... it just gave you a generic always right answer :)

xi_studio · 21 days ago

chatjimmy's trained on LLama 3.1

jvidalv · 21 days ago

Is super fast but also super inaccurate, I would say not even gpt-3 levels.

roywiggins · 21 days ago

That's because it's llama3 8b.

There are a lot of people here that are completely missing the point. What is it called where you look at a point of time and judge an idea without seemingly being able to imagine 5 seconds into the future.

Etheryte · 21 days ago

It is incredibly fast, on that I agree, but even simple queries I tried got very inaccurate answers. Which makes sense, it's essentially a trade off of how much time you give it to "think", but if it's fast to the point where it has no accuracy, I'm not sure I see the appeal.

andrewdea · 21 days ago

the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows

kaashif · 21 days ago

If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.

scotty79 · 21 days ago

I think it might be pretty good for translation. Especially when fed with small chunks of the content at a time so it doesn't lose track on longer texts.

rvz · 21 days ago

Fast, but stupid.

   Me: "How many r's in strawberry?"

   Jimmy: There are 2 r's in "strawberry".

   Generated in 0.001s • 17,825 tok/s

The question is not about how fast it is. The real question(s) are:

   1. How is this worth it over diffusion LLMs (No mention of diffusion LLMs at all in this thread)

(This also assumes that diffusion LLMs will get faster)

The blog answers all those questions. It says they're working on fabbing a reasoning model this summer. It also says how long they think they need to fab new models, and that the chips support LoRAs and tweaking context window size.

I don't get these posts about ChatJimmy's intelligence. It's a heavily quantized Llama 3, using a custom quantization scheme because that was state of the art when they started. They claim they can update quickly (so I wonder why they didn't wait a few more months tbh and fab a newer model). Llama 3 wasn't very smart but so what, a lot of LLM use cases don't need smart, they need fast and cheap.

Also apparently they can run DeepSeek R1 also, and they have benchmarks for that. New models only require a couple of new masks so they're flexible.

The counting rs in strawberry problem was a example of people not understanding how the models work but I guess good to show the limitations of the current architectures.

But thing is, those architectures haven't improved a whole lot. Now when it answers that correctly it's either in training data or by virtue of "count letters" or code sandbox tools.

simlevesque · 21 days ago

LLMs can't count. They need tool use to answer these questions accurately.

Dead Comment

jjcm · 21 days ago

A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...

10b daily tokens growing at an average of 22% every week.

There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.

jtr1 · 21 days ago

Maybe this is a naive question, but why wouldn't there be market for this even for frontier models? If Anthropic wanted to burn Opus 4.6 into a chip, wouldn't there theoretically be a price point where this would lower inference costs for them?

ethmarks · 21 days ago

Because we don't know if this would scale well to high-quality frontier models. If you need to manufacture dedicated hardware for each new model, that adds a lot of expense and causes a lot of e-waste once the next model releases. In contrast, even this current iteration seems like it would be fantastic for low-grade LLM work.

For example, searching a database of tens of millions of text files. Very little "intelligence" is required, but cost and speed are very important. If you want to know something specific on Wikipedia but don't want to figure out which article to search for, you can just have an LLM read the entire English Wikipedia (7,140,211 articles) and compile a report. Doing that would be prohibitively expensive and glacially slow with standard LLM providers, but Taalas could probably do it in a few minutes or even seconds, and it would probably be pretty cheap.

spot5010 · 21 days ago

These seem ideal for robotics applications, where there is a low-latency narrow use case path that these chips can serve, maybe locally.

Exactly. One easily relatable use-case is structured content extraction or/and conversion to markdown for web page data. I used to use groq for same (gpt-oss20b model), but even that used to feel slow when doing theis task at scale.

LLM's have opened-up natural language interface to machines. This chip makes it realtime. And that opens a lot of use-cases.

redman25 · 21 days ago

Many older models are still better at "creative" tasks because new models have been benchmarking for code and reasoning. Pre-training is what gives a model its creativity and layering SFT and RL on top tends to remove some of it in order to have instruction following.

SkyPuncher · 21 days ago

I have such a deep need for something that's just a step above semantic search. These non-frontier models running blazingly fast can solve that.

So many problems simply don't require a full LLM, but more than traditional software. Training a novel model isn't really a compelling argument at most tech startups right now, so you need to find an LLM-native way to do things.

baalimago · 21 days ago

I've never gotten incorrect answers faster than this, wow!

Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!

Make it for Qwen 2.5 and I'd buy it.

You don't actually need "frontier models" for Real Work (c).

(Summarization, classification and the rest of the usual NLP suspects.)

I completely agree. So many things can benefit from having "smart classifiers".

Like, give me semantic search that can detect the difference between SSL and TLS without needing to put a full LLM in the loop.

As someone with a 3060, I can attest that there are really really good 7-9B models. I still use berkeley-nest/Starling-LM-7B-alpha and that model is a few years old.

If we are going for accuracy, the question should be asked multiple times on multiple models and see if there is agreement.

But I do think once you hit 80B, you can struggle to see the difference between SOTA.

That said, GPT4.5 was the GOAT. I can't imagine how expensive that one was to run.

Derbasti · 21 days ago

Amazing! It couldn't answer my question at all, but it couldn't answer it incredibly quickly!

Snarky, but true. It is truly astounding, and feels categorically different. But it's also perfectly useless at the moment. A digital fidget spinner.

anthonypasq · 21 days ago

does no one understand what a tech demo is anymore? do you think this piece of technology is just going to be frozen in time at this capability for eternity?

do you have the foresight of a nematode?

edot · 21 days ago

Yeah, two p’s in the word pepperoni …

Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.

This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

teaearlgraycold · 22 days ago

I'm thinking the best end result would come from custom-built models. An 8 billion parameter generalized model will run really quickly while not being particularly good at anything. But the same parameter count dedicated to parsing emails, RAG summarization, or some other specialized task could be more than good enough while also running at crazy speeds.

thrance · 22 days ago

> What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

Video game NPCs?

Doesn’t pass the high value and require tremendous speed tests.

danpalmer · 22 days ago

Alternatively, you could run far more RAG and thinking to integrate recent knowledge, I would imagine models designed for this putting less emphasis on world knowledge and more on agentic search.

freeone3000 · 21 days ago

Maybe; models with more embedded associations are also better at search. (Intuitively, this tracks; a model with no world knowledge has no awareness of synonyms or relations (a pure markov model), so the more knowledge a model has, the better it can search.) It’s not clear if it’s possible to build such a model, since there doesn’t seem to be a scaling cliff.

pjc50 · 22 days ago

Where are those numbers from? It's not immediately clear to me that you can distribute one model across chips with this design.

> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

Well they claim two month turnaround. Big If True. How does the six months break down in your estimation? Maybe they have found a way to reduce the turnaround time.

adityashankar · 22 days ago

This depends on how much better the models will get from now in, if Claude Opus 4.6 was transformed into one of these chips and ran at a hypothetical 17k tokens/second, I'm sure that would be astounding, this depends on how much better claude Opus 5 would be compared to the current generation

100x of a less good model might be better than 1 of a better model for many many applications.

This isn't ready for phones yet, but think of something like phones where people buy new ones every 3 years and even having a mediocre on-device model at that speed would be incredible for something like siri.

tyushk · 21 days ago

Data tagging? 20k tok/s is at the point where I'd consider running an LLM on data from a column of a database, and these <=100 token problems provide the least chance of hallucination or stupidity.

machiaweliczny · 21 days ago

A lot of NLP tasks could benefit from this

Shaanveer · 22 days ago

ceo

charcircuit · 22 days ago

No one would never give such a weak model that much power over a company.

Deleted Comment

ttul · 20 days ago

The NextPlatform article hints at their approach:

“We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a SINGLE TRANSISTOR. So the density is basically insane. And this is not nuclear physics – it is fully digital. It is just a clever trick that we don’t want to broadcast. But once you hardwire everything, you get this opportunity to stuff very differently than if you have to deal with changing things. The important thing is that we can put a weight and do the multiply associated with it all in one transistor. And you know the multipliers are kind of the big boy piece of the computer.“

One transistor doing 4-bit multiplication? A plausible way to get “4-bit weight plus multiply in one transistor” in a 6 nm FinFET mask-ROM fabric is to make the ROM cell a single device whose drive strength is the stored value. At tapeout you pick one of about 16 discrete strengths (for example by choosing fin count and possibly Vt), so that transistor itself encodes a 4-bit weight. Then you do the multiply in the charge/time domain by encoding the input activation as a discrete pulse width or pulse count and letting the cell source or sink a weight-proportional current onto a precharged bitline for that duration. The resulting bitline voltage change (or time-to-threshold) is proportional to current times time, so it behaves like weight times input and can be accumulated along a column before a simple comparator or time-to-digital readout. It’s “digital” in the sense that both weight and input are quantized, but it relies on device physics; the hard parts are keeping 16 levels separable across PVT, mismatch, and aging, plus managing bitline noise and coupling and ensuring the device stays in a predictable operating region.

VLSI design produces digital outputs, but in the quantum silicon domain, it’s all about the analog…

ttul · 15 days ago

I stand corrected here. The design is "fully digital". They are not using analog trickery here. They are most likely just using a clever collection of very tiny transistors hooked up to two wire layers such that one _digital_ transistor can trigger the correct 4-bit output. This can be accomplished through clever logic design.

Barbing · 19 days ago

TIL your salary

(Kiddin’, my silly way to say thanks for a deeply technical look, helps me understand the kind of knowledge work that might be useful n years from now!)

mncharity · 21 days ago

There's an old idea of adaptive media. Imagine a video drama that's composed of a graph of clips, like an old "choose your own adventure" book ("Do you X? If yes, goto page 45"). With gaze tracking, one can "hmm, the viewer is more focused on character A than B... so we'll give clips and subplots with more A".

Now, when reading, the eye moves in little jumps - saccades. They last 10's of ms, the eye is blind during them, and with high-quality tracking, you know quite early just where that foveal peephole is going to land. So handwave a budget of a few ms for trajectory analysis, a few for 200 Hz rendering latency, and you still have 10-ish ms to play with. At 20k tok/s, that's 200 tok.

So perhaps one might JIT the next sentence, or the topic of the next paragraph, or the entire nature of the document, based on the user's attention. Imagine a universal document - you start reading, and you find the document is about, whatever you wanted it to be about?

awwaiid · 21 days ago

Generative TikTok for words

mncharity · 18 days ago

Hmm... TikTok has apparently long had "text enhanced with background" genres, and TIL, text posts since 2023. So text is ok. But non-independent items? For generative storytelling, "here is a next paragraph for the story", swipe left/right might work? Want to avoid "I don't much like this new paragraph, but I'm afraid to lose it and be stuck with something worse". Swipe left/right and up for continue? Swipe down to revisit old choices? Maybe present new text bolded, appended to old text, for context. Or a "next page of a picture book" idiom. A text field for direct creative or editorial intervention - speech to text. Maybe a side channel input for "story and background should now be soporific". Generative bedtime stories, but incrementally collaboratively created... Thanks for the brainstorming prompt.