aurohacker (u/aurohacker)

mft_ · 3 months ago

I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?

aurohacker · 3 months ago

Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...