From that follows that LLMs fit to produce all kinds of human biases. Like preferring the first choice out of many, and the last our of many (primacy biases). Funnily the LLM might replicate the biases slightly wrong and by doing so produce new derived biases.
But human expectations are also not bias-free (e.g. from the preferring-the-first-choice phenomenon)
How can the RLHF phase eliminate bias if it uses a process(human input) that has the same biases as the pre-training(human input)?
During RLHF, the human evaluators are aware of such biases and are instructed to down-vote the model responses that incorporate such biases.