Correct me if I am wrong, but by the looks of things on that chart the reduction in token use and the better score are all related to the fact that this method used 512 samples.... This doesn't seem to be of any use for local running agents or anything that has severe vram restrictions such as local models that people can run at home. So this would only benefit enterprise level systems no?