Should ChatGPT have the ability to alert a hotline or emergency services when it detects a user is about to commit suicide? Or would it open a can of worms?
I don't think we should have to choose between "sycophantic coddling" and "alert the authorities". Surely there's a middle ground where it should be able to point the user to help and then refuse to participate further.
Of course jailbreaking via things like roleplay might still be possible, but at the point I don't really blame the model if the user is engineering the outcome.
"it encodes the training data in weights to predict a token mimicking a human ..." - better?