To help those who got a bit confused (like me) this Groq the company making accelerators designed specifically for LLM's that they call LPUs (Language Process Units) [0]. So they want to sell you their custom machines that, while expensive, will be much more efficient at running LLMs for you. While there is also Grok [0] which is xAI's series of LLMs and competes with ChatGPT and other models like Claude and DeepSeek.
EDIT - Seems that Groq has stopped selling their chips and now will only partner to fund large build outs of their cloud [2].
I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant. Something like Groq or Digits but at the cost of a high-end gaming PC, like $3k. This has to be a massive market, considering that even ancient Pascal-series GPUs that were once $50 are going for $500.
I have that irresistible urge too, but I have to keep reminding myself that I could spend $2000 in credits over the course of a year, and get the performance and utility of a $40k server, with scalable capacity, and without any risk that that investment will be obsolete when Llama5 comes out.
The Framework Desktop is one not absurdly expensive option. The memory speed isn't great (200 something GB/s), but any faster with those requirements at least doubles the price (e.g. a Mac Studio, only the highest tier M chips have faster memory).
hi! i work @ groq and just made an account here to answer any questions for anyone who might be confused. groq has been around since 2016 and although we do offer hardware for enterprises in the form of dedicated instances, our goal is to make the models that we host easily accessible via groqcloud and groq api (openai compatible) so you can instantly get access to fast inference. :)
we have a pretty generous free tier and a dev tier you can upgrade to for higher rate limits. also, we deeply value privacy and don't retain your data. you can read more about that here: https://groq.com/privacy-policy/
Scout claims a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?
Maverick claims 1 million and Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet
I'd pump the brakes a bit on the 10M context expectations. Its just another linear attention mechanism with rope scaling [1]. They're doing something similar to what cohere did recently, using a global attention mask and a local chunked attention mask.
Notably the max sequence length in training was 256k, but the native short context is still just 8k. I'd expect the retrieval performance to be all over the place here. Looking forward to seeing some topic modeling benchmarks run against it (ill be doing so with some of my local/private datasets).
EDIT: should be fair/complete and note they do claim perfect NIAH text retrieval performance across all 10M tokens for the Scout model on their blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/. There are some serious limitations and caveats to that particular flavor of test though.
I might be biased by the products I'm building but it feels to me that function support is table stakes now? Are open source models are just missing the dataset to fine tune one?
Very few of the models supported on Groq/Together/Fireworks support function calling. And rarely the interesting ones (DeepSeek V3, large llamas, etc)
100%. we've found that llama-3.3-70b-versatile and qwen-qwq-32b perform exceptionally well with reliable function calling. we had recognized the need for this and our engineers partnered with glaive ai to create fine tunes of llama 3.0 specifically for better function calling performance until the llama 3.3 models came along and performed even better.
i'd actually love to hear your experience with llama scout and maverick for function calling. i'm going to dig into it with our resident function calling expert rick lamers this week.
Thank you for saying this out loud. I've been losing my mind wondering where the discussion on this was. LLMs without Tool Use/Function Calling is basically a non starter for anything I want to do.
When I was working with LLMs without function calling I made the scaffold put some information in the system prompt that tells it some JSON-ish syntax it can use to invoke function calls.
It places more of a "mental burden" on the model to output tool calls in your custom format, but it worked enough to be useful.
Although Llama 4 is too big for mere mortals to run without many caveats, the economics of call a dedicated-hosting Llama 4 are more interesting than expected.
$0.11 per 1M tokens, a 10 million content window (not yet implemented in Groq), and faster inference due to fewer activated parameters allows for some specific applications that were not cost-feasible to be done with GPT-4o/Claude 3.7 Sonnet. That's all dependent on whether the quality of Llama 4 is as advertised, of course, particularly around that 10M context window.
It's possible that we'll see smaller Llama 4-based models in the future, though. Similar to Llama 3.2 1B, which was released later than other Llama 3.x models.
Just tried this thank you. Couple qs - looked like just scout access for now, do you have plans for larger model access? Also, seems like context length is always fairly short with you guys, is that architectural or cost-based decisions?
amazing! and yes, we'll have maverick available today. the reason we limit ctx window is because demand > capacity. we're pretty busy with building out more capacity so we can get to a state where we give everyone access to larger context windows without melting our currently available lpus, haha.
cool. I would so happily pay you guys for long context API that aider could point at -- the speed is just game changing. I know your arch is different, so I understand it's an engineering lift. But, I bet you'd find some pareto optimal point in the curve where you could charge a lott more for the speed you guys can do if it's long enough for coding.
I got an error when passing a prompt with about 20k tokens to the Llama 4 Scout model on groq (despite Llama 4 supporting up to 10M token context). groq responds with a POST https://api.groq.com/openai/v1/chat/completions 413 (Payload Too Large) error.
Is there some technical limitation on the context window size with LPUs or is this a temporary stop-gap measure to avoid overloading groq's resources? Or something else?
Seems to be about 500 tk/s. That's actually significantly less than I expected / hoped for, but fantastic compared to nearly anything else. (specdec when?)
Out of curiosity, the console is letting me set max output tokens to 131k but errors above 8192. what's the max intended to be? (8192 max output tokens would be rough after getting spoiled with 128K output of Claude 3.7 Sonnet and 64K of gemini models.)
do you happen to be trying this out on free tier right now? because our rate limits are at 6k tokens per minute on free tier for this model, which might be what you're running into.
When I tried llama4 scout and tried to set the max output tokens above 8192 it told me the max was 8192. Once I set it below, it worked. This was in the console
EDIT - Seems that Groq has stopped selling their chips and now will only partner to fund large build outs of their cloud [2].
0 - https://groq.com/the-groq-lpu-explained/
1 - https://grok.com/
2 - https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware
It's not - it's absolutely a vanishingly small market.
So, an Apple Mac Studio?
https://www.nvidia.com/en-us/products/workstations/dgx-spark...
Deleted Comment
we have a pretty generous free tier and a dev tier you can upgrade to for higher rate limits. also, we deeply value privacy and don't retain your data. you can read more about that here: https://groq.com/privacy-policy/
Deleted Comment
They stopped selling the hardware to the public, and it takes an extraordinary amount of it to run these larger models due to limited ram.
Deleted Comment
Dead Comment
Dead Comment
Dead Comment
All three of those can also be accessed via OpenRouter - with both a chat interface and an API:
- Scout: https://openrouter.ai/meta-llama/llama-4-scout
- Maverick: https://openrouter.ai/meta-llama/llama-4-maverick
Scout claims a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?
Maverick claims 1 million and Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet
Notably the max sequence length in training was 256k, but the native short context is still just 8k. I'd expect the retrieval performance to be all over the place here. Looking forward to seeing some topic modeling benchmarks run against it (ill be doing so with some of my local/private datasets).
[1] https://github.com/meta-llama/llama-models/blob/eececc27d275...
EDIT: should be fair/complete and note they do claim perfect NIAH text retrieval performance across all 10M tokens for the Scout model on their blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/. There are some serious limitations and caveats to that particular flavor of test though.
Would you mind expanding on this? Or point to a reference or two? Thanks! I am trying to understand it.
Very few of the models supported on Groq/Together/Fireworks support function calling. And rarely the interesting ones (DeepSeek V3, large llamas, etc)
i'd actually love to hear your experience with llama scout and maverick for function calling. i'm going to dig into it with our resident function calling expert rick lamers this week.
It places more of a "mental burden" on the model to output tool calls in your custom format, but it worked enough to be useful.
$0.11 per 1M tokens, a 10 million content window (not yet implemented in Groq), and faster inference due to fewer activated parameters allows for some specific applications that were not cost-feasible to be done with GPT-4o/Claude 3.7 Sonnet. That's all dependent on whether the quality of Llama 4 is as advertised, of course, particularly around that 10M context window.
AMD MI300x has day zero support to run it using vLLM. Easy enough to rent them for decent pricing.
Is there some technical limitation on the context window size with LPUs or is this a temporary stop-gap measure to avoid overloading groq's resources? Or something else?
Out of curiosity, the console is letting me set max output tokens to 131k but errors above 8192. what's the max intended to be? (8192 max output tokens would be rough after getting spoiled with 128K output of Claude 3.7 Sonnet and 64K of gemini models.)