Readit News logoReadit News
philipkiely commented on     · Posted by u/philipkiely
philipkiely · 8 days ago
The Information link, for those with a subscription: https://www.theinformation.com/articles/inference-provider-b...
philipkiely commented on Why Fei-Fei Li and Yann LeCun are both betting on "world models"   entropytown.com/articles/... · Posted by u/signa11
nmfisher · a month ago
I wasn't actually able to use it because the servers were overloaded. What exactly impressed you (or more generally, what does it actually let you do at the moment?).
philipkiely · a month ago
You give it a text prompt and optional image.

What you get is a 3D room based on the prompt/image. It rewrites your prompt to a specific format. Overall the rooms tend to be detailed and imaginative.

Then you can fly around the room like in Minecraft creative mode. Really looking forward to more editing features/infill to augment this.

philipkiely commented on Why Fei-Fei Li and Yann LeCun are both betting on "world models"   entropytown.com/articles/... · Posted by u/signa11
philipkiely · a month ago
I played with Marble yesterday, Fei-Fei/World Labs' new product.

It is the most impressed I've been with an AI experience since the first time I saw a model one-shot material code.

Sure, its an early product. The visual output reminds me a lot of early SDXL. But just look at what's happened to video in the last year and image in the last three. The same thing is going to happen here, and fast, and I see the vision for generative worlds for everything from gaming/media to education to RL/simulation.

philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs   baseten.co/blog/sota-perf... · Posted by u/philipkiely
smcleod · 4 months ago
TensorRT-LLM is a right nightmare to setup and maintain. Good on them for getting it to work for them - but it's not for everyone.
philipkiely · 4 months ago
We have built a ton of tooling on top of TRT-LLM and use it not just for LLMs but also for TTS models (Orpheus), STT models (Whisper), and embedding models.
philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs   baseten.co/blog/sota-perf... · Posted by u/philipkiely
nektro · 4 months ago
> we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.

Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps

still impressive work

philipkiely · 4 months ago
Yeah the custom hardware providers are super good at TPS. Kudos to their teams for sure, and the demos of instant reasoning are incredibly impressive.

That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.

Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.

philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs   baseten.co/blog/sota-perf... · Posted by u/philipkiely
lagrange77 · 4 months ago
While you're here..

Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?

The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].

[0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms...

philipkiely · 4 months ago
Yeah we have tried to build calculators before it just depends so much.

Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.

u/philipkiely

KarmaCake day1112August 20, 2018
About
DevRel @ https://baseten.co

Email me: username at baseten.co

View Original