There has been some good research published on this topic of how RLHF, ie aligning to human preferences easily introduces mode collapse and bias into models. For example, with a prompt like: "Choose a random number", the base pretrained model can give relatively random answers, but after fine tuning to produce responses humans like, they become very biased towards responding with numbers like "7" or "42".
I didn't see it in an obvious place on your github, do you have any plans to open source the training code?