I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…
Hi! I work on dynamic shapes in pytorch and would love to hear more about the challenges you’ve run into. We’re always looking to improve the experience, so if you’re open to chatting, feel free to DM me on Twitter (@bobrenjc93) or email me at bobren@meta.com.
since you work on pytorch, what would you say is the best place to ask questions about general usage, trouble shooting? I’ve been struggling with a, what I would consider, a simple torchrun elastic training example, and haven’t found any good resources online. I’ve been spelunking through pytorch but have a feeling a little back and forth with someone familiar with these features would immensely clear things up.
I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.
Yeah you're right (although, they started to open source some of that recently iirc). I meant SOTA for inference engines we can actually download and use ourselves.
While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.
How big of a use case is synthetic data generation? I’m curious as I see a lot about it coming from academic projects but I haven’t seen much related to commercial use cases
Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.
I'm curious what how big is latency tradeoff.
I know assumption here is that it does not matter in those use cases but what order of magnitude it is? 10x? 100x?
this is important for usage in "soft realtime" application, where you do not need instant response but someone is still waiting.
if latency is really big, then it can only be used for basically background processes.
Cool project! The codebase is simple and well documented, a good starting point for anyone interested in how to implement a high-performance inference engine. The prefix sharing is very relevant for anyone running batch inference to generate RL rollouts.
Given chat and API needs for low-latency, llama.cpp is probably still the best choice for self hosted models with or without GPU support. And Ollama is the leader for wrapping llama.cpp.
Because Tokasaurus was mentioned as better than Ollama for conducting darwinian godel machine operations (self-improvement), I looked for the linked repo on GitHub and it was 404. So glad it is back https://github.com/ScalingIntelligence/tokasaurus.
Stanford was edgy enough to reefer to “toking” in the moniker, but exercises restraint by depicting the titular thunder lizard smoking a putatively conventional tobacco cigarette.
I am hoping to use this “Tokasaurus” nickname with affection for my neighbors. If Stanford is ok with informal usage.
Success with Meta AI / Llama 4:
Hey Meta, I would like to see an image of a Tyrannosaurus Rex, who is clad in a leather jacket, sunglasses, and fedora. He is so cool looking, and smoking a joint of marijuana, and his image is superimposed against a skyline of Phoenix in the golden glow of sunset.
Proof that attention is not only highly desired by Stanford tech bros, but HN keyboard warriors equipped with LLM tech. Everyone is clever all of the time.
https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...
I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…
Looks like they don't compare to TensorRT-LLM throughput numbers which, last I checked, are SOTA in open source.
Generation benchmark was 5% faster than SGLang.
Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.
Deleted Comment
this is important for usage in "soft realtime" application, where you do not need instant response but someone is still waiting.
if latency is really big, then it can only be used for basically background processes.
Deleted Comment
Because Tokasaurus was mentioned as better than Ollama for conducting darwinian godel machine operations (self-improvement), I looked for the linked repo on GitHub and it was 404. So glad it is back https://github.com/ScalingIntelligence/tokasaurus.
Deleted Comment
I am hoping to use this “Tokasaurus” nickname with affection for my neighbors. If Stanford is ok with informal usage.
Success with Meta AI / Llama 4:
Hey Meta, I would like to see an image of a Tyrannosaurus Rex, who is clad in a leather jacket, sunglasses, and fedora. He is so cool looking, and smoking a joint of marijuana, and his image is superimposed against a skyline of Phoenix in the golden glow of sunset.
Can you light up the joint with a glowing tip?
Deleted Comment