hendler (u/hendler) - Readit News

anonymousDan · 8 months ago

The hype around agent protocols reminds me of the emperor's new clothes. There's just nothing to it from a technical perspective.

hendler · 8 months ago

HTML was XML for the web. Nothing to it from a technical perspective.

hendler commented on Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference cerebras.ai/blog/llama-40... · Posted by u/benchmarkist

zackangelo · a year ago

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

hendler · a year ago

Check out BaseTen for performant use of GPUs

hendler commented on Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference cerebras.ai/blog/llama-40... · Posted by u/benchmarkist

icelancer · a year ago

Groq is legitimate. Cerebras so far doesn't scale (wide) nearly as good as Groq. We'll see how it goes.

hendler · a year ago

Google TPUs, Amazon, a YC funded ASIC/FPGA company, a Chinese Co. all have custom hardware too that might scale well.

hendler commented on Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference cerebras.ai/blog/llama-40... · Posted by u/benchmarkist

icelancer · a year ago

I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).

Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.

hendler · a year ago

Is this because 405b doesn't fit on Groq? If they perform better, I would also have liked to have seen.

hendler commented on Vector databases are the wrong abstraction timescale.com/blog/vector... · Posted by u/jascha_eng

hendler · a year ago

Seems like a nice abstraction.

Since I see DuckDB mentioned, folks wanting serverless may also be interested in LanceDB, written in Rust, with most features built out for Python.

https://lancedb.com/

https://github.com/lancedb/lancedb

Side note, I wrote a proof of concept of embeddings generator being handled inside PostgreSQL, independent of the index.

https://github.com/Hendler/flame

Posted by u/hendler a year ago

ReasonAgain: Using LLM Generated Symbolic Programs for Mathematical Reasoning arxiv.org/abs/2410.19056...