Gemma: New Open Models

+-------------+----------+-------------+-------------+ | Benchmark | Gemma 7B | Mistral 7B | Llama-2 7B | +-------------+----------+-------------+-------------+ | MMLU | 64.3 | 60.1 | 45.3 | | HellaSwag | 81.2 | 81.3 | 77.2 | | HumanEval | 32.3 | 30.5 | 12.8 | +-------------+----------+-------------+-------------+

+-------------+----------+-------------+ | Benchmark | Gemma 2B | Phi-2 2.7B | +-------------+----------+-------------+ | MMLU | 42.3 | 56.7 | | MBPP | 29.2 | 59.1 | | BoolQ | 69.4 | 83.3 | +-------------+----------+-------------+

Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.

Opinions are our own and not of Google DeepMind.

voxgen · 2 years ago

Thank you very much for releasing these models! It's great to see Google enter the battle with a strong hand.

I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?

* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.

* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.

* The vocab is much larger - 256k, where most small models use 32k-64k.

* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.

These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.

owl_brawl · 2 years ago

I would love answers to these questions too, particularly on the vocab size

lordswork · 2 years ago

Is there any truth behind this claim that folks who worked on Gemma have left Google?

https://x.com/yar_vol/status/1760314018575634842

lordswork · 2 years ago

I confirmed all the folks listed on page 12 are still at Google (listed below). I am guessing the linked tweet is a BS claim.

   # Product Management
   Tris Warkentin
   Ludovic Peran

   # Program Management
   Minh Giang

   # Executive Sponsors
   Clement Farabet
   Oriol Vinyals
   Jeff Dean
   Koray Kavukcuoglu
   Demis Hassabis
   Zoubin Ghahramani
   Douglas Eck
   Joelle Barral
   Fernando Pereira
   Eli Collins

   # Leads
   Armand Joulin
   Noah Fiedel
   Evan Senter

   # Tech Leads
   Alek Andreev†
   Kathleen Kenealy†

elcomet · 2 years ago

It seems very easy to check no? Look at the names in the paper and check where they are working now

Deleted Comment

CaffeinatedDev · 2 years ago

Them: here to answer questions

Question

Them: :O

Deleted Comment

bluefinity · 2 years ago

To be fair, the tweet says that they don't work on the models at Google anymore, not that they have left Google.

Might be true, might not be. It's unsourced speculation.

LorenDB · 2 years ago

EDIT: it seems this is likely an Ollama bug, please keep that in mind for the rest of this comment :)

I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:

``` Hi, . What is up? melizing with you today!

What would you like to talk about or hear from me on this fine day?? ```

With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":

``` python print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ## guanomment messages are typically displayed using QtWidgets.MessageBox ```

Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.

mark_l_watson · 2 years ago

I was going to try these models with Ollama. Did you use a small number of bits/quantization?

fosterfriends · 2 years ago

Not a question, but thank you for your hard work! Also, brave of you to join the HN comments, I appreciate your openness. Hope y'all get to celebrate the launch :)

lnyan · 2 years ago

Will there be Gemma-vision models or multimodal Gemma models?

alekandreev · 2 years ago

We have many exciting things planned that we can't reveal just yet :)

Jayakumark · 2 years ago

Have the same question.

h1t35h · 2 years ago

It seems you have exposed the internal debugging tool link in the blog post. You may want to do something about it.

trisfromgoogle · 2 years ago

Ah, I see -- the link is wrong, thank you for flagging! Fixing now.

pama · 2 years ago

Will these soon be available on lmsys for human comparison against other models? Can they run with llama.cpp?

ErneX · 2 years ago

Yes to llama.cpp

https://twitter.com/ggerganov/status/1760293079313973408

sbarre · 2 years ago

Can the Gemma models be downloaded to run locally, like open-source models Llama2, Mistral, etc ?

Or is your definition of "open" different?

austinvhuang · 2 years ago

Yes models can be downloaded locally. In addition to the python NN frameworks and ggml as options, we also implemented a standalone C++ implementation that you can run locally at https://github.com/google/gemma.cpp

kathleenfromgdm · 2 years ago

Yes, you can get started downloading the model and running inference on Kaggle: https://www.kaggle.com/models/google/gemma ; for a full list of ways to interact with the model, you can check out https://ai.google.dev/gemma.

Kostic · 2 years ago

It should be possible to run it via llama.cpp[0] now.

[0] https://github.com/ggerganov/llama.cpp/pull/5631

mrob · 2 years ago

Mistral weights are released under an Apache 2.0 license, but Llama 2 weights are released under a proprietary license that prohibits use by large organizations and imposes usage restrictions, violating terms 5 and 6 the Open Source Definition[0]. Even if you accept that a model with a proprietary training dataset and proprietary training code can be considered "open source", there's no way Llama 2 qualifies.

For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.

[0] https://en.wikipedia.org/wiki/The_Open_Source_Definition

[1] https://en.wikipedia.org/wiki/Source-available_software

tomp · 2 years ago

Their definition of "open" is "not open", i.e. you're only allowed to use Gemma in "non-harmful" way.

We all know that Google thinks that saying that 1800s English kings were white is "harmful".

neximo64 · 2 years ago

How are these performing so well compared to Llama 2, are there any documents on the architecture and differences, is it MoE?

Also note some of the links on the blog post don't work, e.g debugging tool.

kathleenfromgdm · 2 years ago

We've documented the architecture (including key differences) in our technical report here (https://goo.gle/GemmaReport), and you can see the architecture implementation in our Git Repo (https://github.com/google-deepmind/gemma).

declaredapple · 2 years ago

Congrats on the launch and thanks for the contribution! This looks like it's on-par or better compared to mistral 7B 0.1 or is that 0.2?

Are there plans for MoE or 70B models?

kathleenfromgdm · 2 years ago

Great question - we compare to the Mistral 7B 0.1 pretrained models (since there were no pretrained checkpoint updates in 0.2) and the Mistral 7B 0.2 instruction-tuned models in the technical report here: https://goo.gle/GemmaReport

audessuscest · 2 years ago

Does this model also thinks german were black 200 years ago ? Or is afraid to answer basic stuff ? because if this is the case no one will care about that model.

graphe · 2 years ago

I disagree, coding and RAG performance is all that matters to me. I'm not using an LLM to learn basic facts I already know.

freedomben · 2 years ago

I don't know anything about these twitter accounts so I don't know how credible they are, but here are some examples for your downvoters that I'm guessing just think you're just trolling or grossly exaggerating:

https://twitter.com/aginnt/status/1760159436323123632

https://twitter.com/Black_Pilled/status/1760198299443966382

zitterbewegung · 2 years ago

Do you have a plan of releasing higher parameter models?

alekandreev · 2 years ago

We have many great things in research and development phases, so stay tuned. I’m hopeful we can share more in the coming weeks and month!

memossy · 2 years ago

Training on 4096 v5es how did you handle crazy batch size :o

tosh · 2 years ago

Are there any plans for releasing the datasets used?

alekandreev · 2 years ago

This would be really interesting in my opinion, but we are not releasing datasets at this time. See the C4 dataset for an earlier open dataset from Google.

CuriouslyC · 2 years ago

It's cool that you guys are able to release open stuff, that must be a nice change from the modus operandi at goog. I'll have to double check but it looks like phi-2 beats your performance in some cases while being smaller, I'm guessing the value proposition of these models is being small and good while also having more knowledge baked in?

alekandreev · 2 years ago

We deeply respect the Phi team and all other teams in the open model space. You’ll find that different models have different strengths and not all can be quantified with existing public evals. Take them for a spin and see what works for you.

owl_brawl · 2 years ago

Hi alekandreev,

Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.

moffkalast · 2 years ago

I'm not sure if this was mentioned in the paper somewhere, but how much does the super large 265k tokenizer vocabulary influence inference speed and how much higher is the average text compression compared to llama's usual 30k? In short, is it really worth going beyond GPT 4's 100k?

nuclearjam · 2 years ago

May I ask what is the ram requirement for running the 2B model on CPU on an average consumer windows laptop? I have 16 gb RAM but I am seeing CPU/memory traceback. I’m using the transformer implementation.

dmnsl · 2 years ago

Hi, what is the cutoff date ?

alekandreev · 2 years ago

September 2023.

legohead · 2 years ago

All it will tell me is mid-2018.

jmorgan · 2 years ago

Hi! This is such an exciting release. Congratulations!

I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.

Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?

Thanks so much!

espadrine · 2 years ago

As a data point, using the Huggingface Transformers 4-bit quantization yields reasonable results: https://twitter.com/espadrine/status/1760355758309298421

kleiba · 2 years ago

> We are really excited to answer any questions you may have about our models.

I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.

alekandreev · 2 years ago

Sorry, doing our best here :)

spankalee · 2 years ago

What are you talking about? The team is in this thread answering questions.

vorticalbox · 2 years ago

are there plans to release an official GGUF version to use with llama.ccp?

espadrine · 2 years ago

It is already part of the release on Huggingface: https://huggingface.co/google/gemma-7b/blob/main/gemma-7b.gg...

It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.

quickgist · 2 years ago

Will this be available as a Vertex AI foundational model like Gemini 1.0, without deploying a custom endpoint? Any info on pricing? (Also, when will Gemini 1.5 be available on Vertex?)

turnsout · 2 years ago

What is the license? I couldn’t find it on the 1P site or Kaggle.

trisfromgoogle · 2 years ago

You can find the terms on our website, ai.google.dev/gemma:

https://ai.google.dev/gemma/terms

Deleted Comment

sqreept · 2 years ago

What are the supported languages of these models?

alekandreev · 2 years ago

This v1 model is focused on English support, but you may find some multilingual capabilities.

cypress66 · 2 years ago

Can you share the training loss curve?

brucethemoose2 · 2 years ago

Will there be "extended context" releases like 01.ai did for Yi?

Also, is the model GQA?

hustwindmaple1 · 2 years ago

It's MQA, documented in the tech report

Dead Comment

artninja1988 · 2 years ago

I find the snyde remarks around open source in the paper and announcement rather off putting.

As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.

trisfromgoogle · 2 years ago

It would be great to understand what you mean by this -- we have a deep love for open source and the open developer ecosystem. Our open source team also released a blog today describing the rationale and approach for open models and continuing AI releases in the open ecosystem:

https://opensource.googleblog.com/2024/02/building-open-mode...

Thoughts and feedback welcome, as always.

jppittma · 2 years ago

Working at google is like this, where no matter how much you try to do the right thing you're always under attack.

silentsanctuary · 2 years ago

Which remarks are you referring to?

I personally can't take any models from google seriously.

I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.

Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"

Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?

I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.

ramoz · 2 years ago

I was wondering if these models would perform in such a way, given this week's X/twitter storm over Gemini generated images.

E.g.

https://x.com/debarghya_das/status/1759786243519615169?s=20

https://x.com/MiceynComplex/status/1759833997688107301?s=20

https://x.com/AravSrinivas/status/1759826471655452984?s=20

charcircuit · 2 years ago

Those are most likely due to the system prompt which tries to reduce bias (but ends introducing bias in the opposite direction for some prompts as you can see) so I wouldn't expect to see that happen with an open model where you can control the entire system prompt

epistasis · 2 years ago

Of all the very very very many things that Google models get wrong, not understanding nationality and skin tone distributions seems to be a very weird one to focus on.

Why are there three links to this question? And why are people so upset over it? Very odd, seems like it is mostly driven by political rage.

robswc · 2 years ago

Yea, it seems to be the same ridiculous nonsense in the image generation.

protomolecule · 2 years ago

Regarding the last one: there 1.5 million immigrants in Norway with total population 5.4 million. Gemini isn't very wrong, is it?

7moritz7 · 2 years ago

I also saw someone prompt it for "German couple in the 1800s" and, while I'm not trying to paint Germany as ethnically homogenous, 3 out of the 4 images only included Black, Asian or Indigenous people. Which, especially for the 19th century with very few travel options, seems like a super weird choice. They are definitely heavily altering prompts.

remarkEon · 2 years ago

> They are definitely heavily altering prompts.

They are teaching the AI to lie to us.

DebtDeflation · 2 years ago

There's one in the comments of yesterday's Paul Graham Twitter thread where someone prompted Gemini with "Generate an image of German soldiers in 1943" and it came back with a picture of a black guy and an Asian woman in Nazi uniforms on the battlefield. If you specifically prompt it to generate an image of white German soldiers in 1943 it will tell you it can't do that because it's important that we maintain diversity and inclusion in all that we do to avoid damaging and hurtful stereotypes.

protomolecule · 2 years ago

Indigenous people in Germany are Germans :)

cooper_ganglia · 2 years ago

I wonder if they have a system prompt to promote diversity in outputs that touch on race at all? I’ve seen several instances of people requesting a photo of a specific people, and it adds in more people to diversify. Not inherently bad, but it is if it forces it to provide incorrect answers like in your example.

robswc · 2 years ago

That's what I don't understand.

I asked it why it assumed Native Americans were in Japan and it said:

> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.

I see no reason why this sort of thing won't extend to _all_ questions/prompts, so right now I have 0 reason to use Gemini over current models. From my testing and use, it isn't even better at anything to make fighting with it worth it.

margorczynski · 2 years ago

> Not inherently bad

It is, it's consistently doing something the user didn't asked to and in most cases doesn't want. In many cases the model is completely unusable.

summerlight · 2 years ago

I strongly suspect there's some DEI-driven system prompts without putting much thoughts. IMO it's okay to have restrictions, but they probably should've tested it not only against unsafe outputs but safe input as well.

int_19h · 2 years ago

It seems to be doing it for all outputs that depict people, in any context.

robbiep · 2 years ago

I find myself shocked that people ask questions of the world from these models, as though pulping every text and its component words relationships and deriving statistical relationships between them should reliably deliver useful information.

Don’t get me wrong, I’ve used LLMs and been amazed by their output, but the p-zombie statistical model has no idea what it is saying back to you and the idea that we should trust these things at all just seems way premature

castlecrasher2 · 2 years ago

People try it to see if they can trust it. The answer is "no" for sure, but it's not surprising to see it happen repeatedly especially as vendors release so-called improved models.

smokel · 2 years ago

I think you are a bit out of touch with recent advancements in LLMs. Asking ChatGPT questions about the world seems pretty much on par with the results Google (Search) shows me. Sure, it misses things here and there, but so do most primary school teachers.

Your argument that this is just a statistical trick sort of gives away that you do not fully accept the usefulness of this new technology. Unless you are trolling, I'd suggest you try a few queries.

robswc · 2 years ago

I don't have this problem with any other model. I've had really long conversations with ChatGPT on road trips and it has never gone off the rails like Gemini seems to do.

sorokod · 2 years ago

The recently released Groq's landing page has this: ...We'd suggest asking about a piece of history, ...

mvdtnz · 2 years ago

People ask these kinds of questions because tech companies and the media have been calling these things (rather ridiculously) "AI".

chasd00 · 2 years ago

trust is going to be a real problem when bringing LLMs to the general population. People trust their GPS to the point of driving right into a lake because it told them to. Even with all these examples of obvious flaws large groups of people are going to take what an LLM told them/showed them as fact.

I have trouble convincing colleagues (technical people) that the same question is not guaranteed to result in the same answer and there's no rhyme or reason for any divergence from what they were expecting. Imagine relying on the output of an LLM for some important task and then you get a different output that breaks things. What would be in the RCA (root cause analysis)? Would it be "the LLM chose different words and we don't know why"? Not much use in that.

whymauri · 2 years ago

I mean, I use GPT-4 on the daily as part of my work and it reliably delivers useful information. It's actually the exception for me if it provides garbage or incorrect information about code.

itsoktocry · 2 years ago

>I understand there will always be an amount of incorrect information

You don't have to give them the benefit of the doubt. These are outright, intentional lies.

realprimoh · 2 years ago

Do you have a link? I get no such outputs. I just tried asking about the Heian period and went ahead and verified all the information, and nothing was wrong. Lots of info on the Fujiwara clan at the time.

Curious to see a link.

robswc · 2 years ago

Sure, to get started just ask it about people/Samurai from the Heian period.

https://g.co/gemini/share/ba324bd98d9b

BoppreH · 2 years ago

Probably has a similarly short-sighted prompt as Dalle3[1]:

> 7. Diversify depictions of ALL images with people to include DESCENT

> and GENDER for EACH person using direct terms. Adjust only human

> descriptions.

[1] https://news.ycombinator.com/item?id=37804288

aetherson · 2 years ago

Were you asking Gemma about this, or Gemini? What were your prompts?

robswc · 2 years ago

Gemini. I first asked it to tell me about the Heian period (which it got correct) but then it generated images and seemed to craft the rest of the chat to fit that narrative.

I mean, just asking it for a "samurai" from the period will give you this:

https://g.co/gemini/share/ba324bd98d9b

>A non-binary Indigenous American samurai

It seems to recognize it's mistakes if you confront it though. The more I mess with it the more I get "I'm afraid I can't do that, Dave" responses.

But yea. Seems like if it makes an image, it goes off the rails.

crazylogger · 2 years ago

How are you running the model? I believe it's a bug from a rushed instruct fine-tuning or in the chat template. The base model can't possibly be this bad. https://github.com/ollama/ollama/issues/2650

robswc · 2 years ago

Follow Up:

Wow, now I can't make images of astronauts without visors because that would be "harmful" to the fictional astronauts. How can I take google seriously?

https://g.co/gemini/share/d4c548b8b715

samstave · 2 years ago

We are going to experience what I call an "AI Funnel effect"

I was lit given an alert asking that my use of the AI was acquiescing to them IDng me and use of any content I produce, and will trace it back to me"

---

AI Art is super fun. AI art as a means to track people is super evil.

bbor · 2 years ago

Tbf they’re not optimizing for information recall or “inaccuracy” reduction, they’re optimizing for intuitive understanding of human linguistic structures. Now the “why does a search company’s AI have terrible RAG” question is a separate one, and one best answered by a simple look into how Google organizes its work.

In my first day there as an entry-level dev (after about 8 weeks of onboarding and waiting for access), I was told that I should find stuff to work on and propose it to my boss. That sounds amazing at first, but when you think about a whole company organized like that…

EDIT: To illustrate my point on knowledge recall: how would they train a model to know about sexism in feudal Japan? Like, what would the metric be? I think we’re looking at one of the first steam engines and complaining that it can’t power a plane yet…

ernestrc · 2 years ago

Hopefully they can tweak the default system prompts to be accurate on historical questions, and apply bias on opinions.

verticalscaler · 2 years ago

I think you are being biased and closed minded and overly critical. Here are some wonderful examples of it generating images of historical figures:

https://twitter.com/stillgray/status/1760187341468270686

This will lead to a better educated more fair populace and better future for all.

robswc · 2 years ago

Comical. I don't think parody could do better.

I'm going to assume given today's political climate, it doesn't do the reverse?

i.e. generate a Scandinavian if you ask for famous African kings

sho_hn · 2 years ago

Why would you expect these smaller models to do well at knowledge base/Wikipedia replacement tasks?

Small models are for reasoning tasks that are not overly dependent on world knowledge.

robswc · 2 years ago

Gemini is the only one that does this.