How does the target model validate the draft tokens without running the inference as normal?
Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.
I think this answer is as good as any of the human-generated ones in the thread so far, but the real power is that you can ask it follow-up questions. https://chatgpt.com/share/6894504f-4458-8008-a8c9-f371588259...
The Box-Muller transform is slow because it requires log, sqrt, sin, and cos. Depending on your needs, you can approximate all of these.
log2 can be easily approximated using fast inverse square root tricks:
(conveniently, this also negates the need to ensure your input is not zero)sqrt is pretty fast; turn `-ffast-math` on. (this is already the default on GPUs) (remember that you're normalizing the resultant vector, so add this to the mag_sqr before square rooting it)
The slow part of sin/cos is precise range reduction. We don't need that. The input to sin/cos Box-Muller is by construction in the range [0,2pi]. Range reduction is a no-op.
For my particular niche, these approximations and the resulting biases are justified. YMMV. When I last looked at it, the fast log2 gave a bunch of linearities where you wanted it to be smooth, however across multiple dimensions these linearities seemed to cancel out.
If you want fast sqrt (or more generally, if you care at all about not getting garbage), I would recommend using an explicit approx sqrt function in your programming language rather than turning on fastmath.