Readit News logoReadit News
melded commented on Heretic: Automatic censorship removal for language models   github.com/p-e-w/heretic... · Posted by u/melded
jameslk · a month ago
Could models mitigate this by answering questions incorrectly with random information instead of outright refusing to answer them?
melded · a month ago
from what i understand, they dont really have the self-awareness/agency to do this kind of thing on purpose as a response to abliteration (although if they end up having to converse on topics for which there was no data in their training dataset, they will produce incorrect and random information, but not for lack of "trying").

but with some (unmodified) models ive tried (i dont remember names unfortunately) it definitely seemed like they werent trained to outright refuse things but answer poorly instead. so it is my impression that that is indeed a strategy that some model producers use?

(if anyone can debunk this id be interested in hearing it, im only superficially familiar with the methods in use, and this is basically a guess about what would explain why those models behaved the way they did.)

melded commented on Heretic: Automatic censorship removal for language models   github.com/p-e-w/heretic... · Posted by u/melded
SilverElfin · a month ago
How do you remove censorship that appears due to the biased selection of training data?
melded · a month ago
in that case you'd need to do actual training/finetuning with a dataset that has information about things that were left out of the original training data.

u/melded

KarmaCake day246November 16, 2025View Original