A lot of this must come down to execution. And there's a lot of snake oil out there at the execution layer.
A lot of this must come down to execution. And there's a lot of snake oil out there at the execution layer.
My advice here is make the model your own. Its open weight, I encourage it to be make it useful for your use case and your users, and beneficial for society as well. We did our best to give you a great starting point, and for Norwegian in particular we intentionally kept the large embedding table to make adaption to larger vocabularies easier.
I run a game when players can post messages, it's a game where players can kill each other, and people often send threats along the lines of "I will kill you". Telling Gemma that it should classify a message as game related or a real life threat, and that it is for a message in a game where players can kill each other and threats are a part of the game, and that it should mark it as game related if it is unclear if the message is a game related threat or a real life threat does not work well. For other similar tasks it seems to follow instructions well, but for serious topics it seems to be very biased, and often err on the side of caution, despite being told not to. Sometimes it even spits out some help lines to contact.
I guess this is because it was trained to be safe, and that affects it's ability to follow instructions for this? Or am I completely off here?
For your use case, you should probably fine tune the model to reduce the rejection rate.
I think that the next step is getting an official "checked" mark by the SWE bench team
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...
Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:
we share the public/consumer simulators, but we also build bespoke environments on a per customer basis (think enterprise sites or even full VMs loaded with applications and data).
environment creation scalability is a big priority for us. we currently automate most of the process, but it still takes a fair bit of manual work to finish them and to get the details right. there is some reusability across environments, for example, we can use the flight results generation code in any travel/flightbooking sim. we also have some semi-automated approaches for creating tasks and verifiers. but still lots of work to be done here.
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.