In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.
By comparison, when I saw a few of Donatello’s sculptures I was surprised by how poorly he was able to convey faces compares to other artists. I am no art historian, I am just a big fan of TMNT :)
Compared to his contemporaries? He is possibly the most influential sculptor of the Renaissance. Michelangelo wasn't even born yet when Donatello died.
What exactly did I write that is wrong? UG argues a universal grammar exists, and that humans innately posses knowledge about this grammar. It's this grammar that enables humans to learn a language (according to UG). They (UG linguists) have created a system of syntax rules that attempts to describe any language, but failed at doing so once they step outside of Indo-European languages. This is partly because Chomsky was hired by MIT to solve machine translation and putting language into a set of neat boxes was his best idea and partly because Chomsky himself had little knowledge of other languages. It's pseudoscience.
It’s notable how successful LLMs despite the lack of any linguistic tools in their architectures. It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens. Surely, the question of “a/an” would be solved with ease as the model would be required to come up with a noun token before choosing its determiner. I wonder if the developers of LLMs explored those approaches but found them infeasible due to large preprocessing times, immaturity of such tools and/or little benefit.
grammar as we know it was devised for the Latin language and linguists spend most of the time attempting to fit other languages into neat boxes that the Latin grammar wasn't designed for. This of course leads to absurdity. Chomsky attempted to solve this problem with his universal grammar, but that too stops working quickly once you get outside of European languages. That is, ignoring linguistic tools is one of the reasons GPT is successful.
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.