One guess: maybe running multiple different fine-tuning style operations isn't actually that expensive - order of hundreds or thousands of dollars per run once you've trained the rest of the model.
I expect the majority of their evaluations are then automated, LLM-as-a-judge style. They presumably only manually test the best candidates from those automated runs.
How do you tell whether this is helpful? Like if you're just putting stuff in a system prompt, you can plausibly a/b test changes. But if you throwing it into pretraining, can Anthropic afford to re-run all of post-training on different versions to see if adding stuff like "Claude also has an incredible opportunity to do a lot of good in the world by helping people with a wide range of tasks." actually makes any difference? Is there a tractable way to do this that isn't just writing a big document of feel-good affirmations and hoping for the best?
How do I know if the competing app is actually better? I mean, this was the advertising angle for eHarmony about a decade ago - that it was much better than competitors at actually turning matches into marriages. But this claim was found to be misleading, and they were advised to stop using it.
Could a potential customer really get to the bottom of which site is the best at finding a real match? It's not like a pizza restaurant where I can easily just a bunch until I find my favorite and then keep buying it. Dating apps are like a multi-armed bandit problem, but you stop pulling arms once you get one success. So your only direct feedback is failed matches.
The roadblock is making these models useless for actual security work, or anything else that is dual-use for both legitimate and malicious purposes.
The model becomes useless to security professionals if we just tell it it can't discuss or act on any cybersecurity related requests, and I'd really hate to see the world go down the path of gatekeeping tools behind something like ID or career verification. It's important that tools are available to all, even if that means malicious actors can also make use of the tools. It's a tradeoff we need to be willing to make.
> human with this level of cybersecurity skills would surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".
Happens all the time. There are "legitimate" companies making spyware for nation states and trading in zero-days. Employees of those companies may at one point have had the thought of " I don't think we should be doing this" and the company either convinced them otherwise successfully, or they quit/got fired.
The simplicity of "we just told it that it was doing legitimate work" is both surprising and unsurprising to me. Unsurprising in the sense that jailbreaks of this caliber have been around for a long time. Surprising in the sense that any human with this level of cybersecurity skills would surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".
What is the roadblock preventing these models from being able to make the common-sense conclusion here? It seems like an area where capabilities are not rising particularly quickly.
Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.