philipkiely (u/philipkiely)

philipkiely commented on · Posted by u/philipkiely

The Information link, for those with a subscription: https://www.theinformation.com/articles/inference-provider-b...

philipkiely commented on Why Fei-Fei Li and Yann LeCun are both betting on "world models" entropytown.com/articles/... · Posted by u/signa11

nmfisher · a month ago

I wasn't actually able to use it because the servers were overloaded. What exactly impressed you (or more generally, what does it actually let you do at the moment?).

philipkiely · a month ago

You give it a text prompt and optional image.

What you get is a 3D room based on the prompt/image. It rewrites your prompt to a specific format. Overall the rooms tend to be detailed and imaginative.

Then you can fly around the room like in Minecraft creative mode. Really looking forward to more editing features/infill to augment this.

philipkiely commented on Why Fei-Fei Li and Yann LeCun are both betting on "world models" entropytown.com/articles/... · Posted by u/signa11

philipkiely · a month ago

I played with Marble yesterday, Fei-Fei/World Labs' new product.

It is the most impressed I've been with an AI experience since the first time I saw a model one-shot material code.

Sure, its an early product. The visual output reminds me a lot of early SDXL. But just look at what's happened to video in the last year and image in the last three. The same thing is going to happen here, and fast, and I see the vision for generative worlds for everything from gaming/media to education to RL/simulation.

philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs baseten.co/blog/sota-perf... · Posted by u/philipkiely

smcleod · 4 months ago

TensorRT-LLM is a right nightmare to setup and maintain. Good on them for getting it to work for them - but it's not for everyone.

philipkiely · 4 months ago

We have built a ton of tooling on top of TRT-LLM and use it not just for LLMs but also for TTS models (Orpheus), STT models (Whisper), and embedding models.

philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs baseten.co/blog/sota-perf... · Posted by u/philipkiely

nektro · 4 months ago

> we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.

Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps

still impressive work

philipkiely · 4 months ago

Yeah the custom hardware providers are super good at TPS. Kudos to their teams for sure, and the demos of instant reasoning are incredibly impressive.

That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.

Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.

philipkiely commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs baseten.co/blog/sota-perf... · Posted by u/philipkiely

lagrange77 · 4 months ago

While you're here..

Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?

The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].

[0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms...

philipkiely · 4 months ago

Yeah we have tried to build calculators before it just depends so much.

Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.

u/philipkiely

KarmaCake day1112August 20, 2018

About

DevRel @ https://baseten.co

Email me: username at baseten.co

View Original