Readit News logoReadit News
tosh · 2 years ago
Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B

  +-------------+----------+-------------+-------------+
  | Benchmark   | Gemma 7B | Mistral 7B  | Llama-2 7B  |
  +-------------+----------+-------------+-------------+
  | MMLU        |   64.3   |     60.1    |     45.3    |
  | HellaSwag   |   81.2   |     81.3    |     77.2    |
  | HumanEval   |   32.3   |     30.5    |     12.8    |
  +-------------+----------+-------------+-------------+
via https://mistral.ai/news/announcing-mistral-7b/

sa-code · 2 years ago
Thank you. I thought it was weird for them to release a 7B model and not mention Mistral in their release.
mochomocha · 2 years ago
The technical report (linked in the 2nd paragraph of the blog post) mentions it, and compares against it: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...
nl · 2 years ago
The release page has comparisons to Mistral everywhere: https://ai.google.dev/gemma

Deleted Comment

mirekrusin · 2 years ago
They forgot.

Also phi-2.

brucethemoose2 · 2 years ago
Only 8K context as well, like Mistral.

Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.

DreamGen · 2 years ago
Mistral Instruct v0.2 is 32K.
tosh · 2 years ago
Agree: will be interesting how Gemma does on ChatBot Arena
Kydlaw · 2 years ago
They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:

"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive outputs."

YetAnotherNick · 2 years ago
According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.

Also for note on their human evaluations, Gemma 7B IT has a 51.7% win rate against Mistral v0.2 7B Instruct.

jcuenod · 2 years ago
Came here to post the same thing for Phi-2:

  +-------------+----------+-------------+
  | Benchmark   | Gemma 2B | Phi-2 2.7B  |
  +-------------+----------+-------------+
  | MMLU        |   42.3   |     56.7    |
  | MBPP        |   29.2   |     59.1    |
  | BoolQ       |   69.4   |     83.3    |
  +-------------+----------+-------------+

[0] https://www.kaggle.com/models/google/gemma

[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...

rfw300 · 2 years ago
A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.
daemonologist · 2 years ago
Really looking forward to the day someone puts out an open model which outperforms Flan-T5 on BoolQ.
FergusArgyll · 2 years ago
the real gold will be when this gets finetuned. (maybe by mistral...)
brucethemoose2 · 2 years ago
TBH the community has largely outrun Mistral's own finetuning. The 7B model in particular is such a popular target because its so practical to train.
itomatik · 2 years ago
how does one finetune llama (or any other LLM) using mistral?

is the flow like this?

- take small dataset

- generate bigger dataset using mistral (how this is this done?)

- run LoRA to fine tune gemma extended dataset.

Deleted Comment

heckl239u · 2 years ago
https://www.youtube.com/watch?v=1Mn0U6HGLeg some test vids came out on the 7b model. Shock it doesn't perform well at all.
attentive · 2 years ago
In my subjective tests it's not even close to Mistral. While my local gemma is quantized, so is mistral.

But I also tried gemma on huggingface.co/chat which I assume isn't quantized.

Deleted Comment

lawxls · 2 years ago
Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.

Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.

scarmig · 2 years ago
Who cares if it's a PR stunt to improve developer good will? It's still a good thing, and it's now the most open model out there.
crossroadsguy · 2 years ago
That’s about the point of having a developer ecosystem, isn’t it?
kiraaa · 2 years ago
mistral 7b v0.2 supports 32k
simonw · 2 years ago
The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policy

Something that caught my eye in the terms:

> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.

One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.

a2128 · 2 years ago
This is actually not that unusual. Stable Diffusion's license, CreativeML Open RAIL-M, has the exact same clause: "You shall undertake reasonable efforts to use the latest version of the Model."

Obviously updating the model is not very practical when you're using finetuned versions, and people still use old versions of Stable Diffusion. But it does make me fear the possibility that if they ever want to "revoke" everybody's license to use the model, all they have to do is just post a model update that's functionally useless for anything and go after anyone still using the old versions that actually do anything.

slowmovintarget · 2 years ago
So if they wish to apply censorship they forgot, or suddenly discovered a reason for, they want you to be obligated to take it.

Good faith possibilities: Copyright liability requires retraining, or altering the underlying training set.

Gray area: "Safety" concerns where the model recommends criminal behavior (see uncensored GPT 4 evaluations).

Bad faith: Censorship or extra weighting added based on political agenda or for-pay skewing of results.

iandanforth · 2 years ago
These are all very new licenses that deviate from OSI principles, I think it's fair to call them "unusual".
simonw · 2 years ago
That's useful context, thanks - I hadn't realized this clause was already out there for other models.
wongarsu · 2 years ago
I don't think a broken model would trigger that clause in a meaningful way, because then you simply can't update with reasonable effort. You would be obliged to try the new model in a test environment, and as soon as you notice it doesn't perform and making it perform would require unreasonable effort you can simply stay on the old version.

However you might be required to update if they do more subtle changes, like a new version that only speaks positively about Google and only negatively about Microsoft. Provided this doesn't have an obvious adverse impact on your use of the model.

ummonk · 2 years ago
Switching to a model that is functionally useless doesn't seem to fall under "reasonable efforts" to me, but IANAL.
Silphendio · 2 years ago
It's worth noting that Stable Diffusion XL uses the OpenRAIL++-M License, which removed the update obligation.
jacooper · 2 years ago
Why the hell do they use such a crappy license in the first place?
tgtweak · 2 years ago
I don't think there's a way they can enforce that reasonably. There's no connection to the mothership to report back what version is being used or license keys at runtime...

Seems more like a "if we discover something unsafe you should update your model and we aren't liable if you don't" than something that would make your model stop working.

summerlight · 2 years ago
This kind of defensive statements in ToS are usually due to obscure regulation or leading cases and model developers need a way to limit liability. There's no practical way to enforce this, but they can claim that when bad things happen it's purely on model users rather than model developers.
pram · 2 years ago
They have to make sure you’re receiving the most cutting edge chiding lectures when you make naughty and problematic requests.
astrange · 2 years ago
You can't make a local model do that. eg force the answer to begin with "Yes" or use control vectors so it agrees with it.
xyzzyz · 2 years ago
This is strangely reminiscent of the Soviet Union, where after they got rid of Lavrentiy Beria, they mailed the update to subscribers of the Great Soviet Encyclopedia, where they asked to remove the three pages with Beria’s biography and replace them with the three provided pages.
legohead · 2 years ago
Sounds like it's "reasonable" for you not to update then.
wahnfrieden · 2 years ago
It says you must make efforts (to a reasonable extent), not that you must give a reason for not making efforts
maronato · 2 years ago
This sounds like a clause to cover themselves in case older versions have any serious issues
catchnear4321 · 2 years ago
reasonable effort - meaning if their changes meaningfully impact my usage, negatively, it would be unreasonable to ask me to upgrade.

sounds good.

this is not financial advice and ianal.

res0nat0r · 2 years ago
Isn't this just lawyer speak for "we update our model a lot, and we've never signed off on saying we're going to support every previous release we've ever published, and may turn them off at any time, don't complain about it when we do."
4bpp · 2 years ago
Ugh, I would fully expect this kind of clause to start popping up in other software ToSes soon if it hasn't already. Contractually mandatory automatic updates.
bsimpson · 2 years ago
I appreciated this post clarifying the distinction between "open model" and "open source":

https://opensource.googleblog.com/2024/02/building-open-mode...

I'm not sure how to feel about the restrictions. "No porn" feels prudish, particularly for this millennium. I tend to err on the side of freedom in intellectual/political matters; however, the others seem fairly reasonable as far as restrictions go.

Deleted Comment

phillipcarter · 2 years ago
Huh. I wonder why is that a part of the terms. I feel like that's more of a support concern.

Deleted Comment

silverliver · 2 years ago
You don't have to agree to this policy to use the model.
samstave · 2 years ago
model watermarking? does this exist?

Dead Comment

alekandreev · 2 years ago
Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.

Opinions are our own and not of Google DeepMind.

voxgen · 2 years ago
Thank you very much for releasing these models! It's great to see Google enter the battle with a strong hand.

I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?

* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.

* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.

* The vocab is much larger - 256k, where most small models use 32k-64k.

* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.

These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.

owl_brawl · 2 years ago
I would love answers to these questions too, particularly on the vocab size
lordswork · 2 years ago
Is there any truth behind this claim that folks who worked on Gemma have left Google?

https://x.com/yar_vol/status/1760314018575634842

lordswork · 2 years ago
I confirmed all the folks listed on page 12 are still at Google (listed below). I am guessing the linked tweet is a BS claim.

   # Product Management
   Tris Warkentin
   Ludovic Peran

   # Program Management
   Minh Giang

   # Executive Sponsors
   Clement Farabet
   Oriol Vinyals
   Jeff Dean
   Koray Kavukcuoglu
   Demis Hassabis
   Zoubin Ghahramani
   Douglas Eck
   Joelle Barral
   Fernando Pereira
   Eli Collins

   # Leads
   Armand Joulin
   Noah Fiedel
   Evan Senter

   # Tech Leads
   Alek Andreev†
   Kathleen Kenealy†

elcomet · 2 years ago
It seems very easy to check no? Look at the names in the paper and check where they are working now

Deleted Comment

CaffeinatedDev · 2 years ago
Them: here to answer questions

Question

Them: :O

Deleted Comment

bluefinity · 2 years ago
To be fair, the tweet says that they don't work on the models at Google anymore, not that they have left Google.

Might be true, might not be. It's unsourced speculation.

LorenDB · 2 years ago
EDIT: it seems this is likely an Ollama bug, please keep that in mind for the rest of this comment :)

I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:

``` Hi, . What is up? melizing with you today!

What would you like to talk about or hear from me on this fine day?? ```

With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":

``` python print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ## guanomment messages are typically displayed using QtWidgets.MessageBox ```

Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.

mark_l_watson · 2 years ago
I was going to try these models with Ollama. Did you use a small number of bits/quantization?
fosterfriends · 2 years ago
Not a question, but thank you for your hard work! Also, brave of you to join the HN comments, I appreciate your openness. Hope y'all get to celebrate the launch :)
lnyan · 2 years ago
Will there be Gemma-vision models or multimodal Gemma models?
alekandreev · 2 years ago
We have many exciting things planned that we can't reveal just yet :)
Jayakumark · 2 years ago
Have the same question.
h1t35h · 2 years ago
It seems you have exposed the internal debugging tool link in the blog post. You may want to do something about it.
trisfromgoogle · 2 years ago
Ah, I see -- the link is wrong, thank you for flagging! Fixing now.
pama · 2 years ago
Will these soon be available on lmsys for human comparison against other models? Can they run with llama.cpp?
ErneX · 2 years ago
sbarre · 2 years ago
Can the Gemma models be downloaded to run locally, like open-source models Llama2, Mistral, etc ?

Or is your definition of "open" different?

austinvhuang · 2 years ago
Yes models can be downloaded locally. In addition to the python NN frameworks and ggml as options, we also implemented a standalone C++ implementation that you can run locally at https://github.com/google/gemma.cpp
kathleenfromgdm · 2 years ago
Yes, you can get started downloading the model and running inference on Kaggle: https://www.kaggle.com/models/google/gemma ; for a full list of ways to interact with the model, you can check out https://ai.google.dev/gemma.
Kostic · 2 years ago
It should be possible to run it via llama.cpp[0] now.

[0] https://github.com/ggerganov/llama.cpp/pull/5631

mrob · 2 years ago
Mistral weights are released under an Apache 2.0 license, but Llama 2 weights are released under a proprietary license that prohibits use by large organizations and imposes usage restrictions, violating terms 5 and 6 the Open Source Definition[0]. Even if you accept that a model with a proprietary training dataset and proprietary training code can be considered "open source", there's no way Llama 2 qualifies.

For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.

[0] https://en.wikipedia.org/wiki/The_Open_Source_Definition

[1] https://en.wikipedia.org/wiki/Source-available_software

tomp · 2 years ago
Their definition of "open" is "not open", i.e. you're only allowed to use Gemma in "non-harmful" way.

We all know that Google thinks that saying that 1800s English kings were white is "harmful".

neximo64 · 2 years ago
How are these performing so well compared to Llama 2, are there any documents on the architecture and differences, is it MoE?

Also note some of the links on the blog post don't work, e.g debugging tool.

kathleenfromgdm · 2 years ago
We've documented the architecture (including key differences) in our technical report here (https://goo.gle/GemmaReport), and you can see the architecture implementation in our Git Repo (https://github.com/google-deepmind/gemma).
declaredapple · 2 years ago
Congrats on the launch and thanks for the contribution! This looks like it's on-par or better compared to mistral 7B 0.1 or is that 0.2?

Are there plans for MoE or 70B models?

kathleenfromgdm · 2 years ago
Great question - we compare to the Mistral 7B 0.1 pretrained models (since there were no pretrained checkpoint updates in 0.2) and the Mistral 7B 0.2 instruction-tuned models in the technical report here: https://goo.gle/GemmaReport
audessuscest · 2 years ago
Does this model also thinks german were black 200 years ago ? Or is afraid to answer basic stuff ? because if this is the case no one will care about that model.
graphe · 2 years ago
I disagree, coding and RAG performance is all that matters to me. I'm not using an LLM to learn basic facts I already know.
freedomben · 2 years ago
I don't know anything about these twitter accounts so I don't know how credible they are, but here are some examples for your downvoters that I'm guessing just think you're just trolling or grossly exaggerating:

https://twitter.com/aginnt/status/1760159436323123632

https://twitter.com/Black_Pilled/status/1760198299443966382

zitterbewegung · 2 years ago
Do you have a plan of releasing higher parameter models?
alekandreev · 2 years ago
We have many great things in research and development phases, so stay tuned. I’m hopeful we can share more in the coming weeks and month!
memossy · 2 years ago
Training on 4096 v5es how did you handle crazy batch size :o
tosh · 2 years ago
Are there any plans for releasing the datasets used?
alekandreev · 2 years ago
This would be really interesting in my opinion, but we are not releasing datasets at this time. See the C4 dataset for an earlier open dataset from Google.
CuriouslyC · 2 years ago
It's cool that you guys are able to release open stuff, that must be a nice change from the modus operandi at goog. I'll have to double check but it looks like phi-2 beats your performance in some cases while being smaller, I'm guessing the value proposition of these models is being small and good while also having more knowledge baked in?
alekandreev · 2 years ago
We deeply respect the Phi team and all other teams in the open model space. You’ll find that different models have different strengths and not all can be quantified with existing public evals. Take them for a spin and see what works for you.
owl_brawl · 2 years ago
Hi alekandreev,

Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.

moffkalast · 2 years ago
I'm not sure if this was mentioned in the paper somewhere, but how much does the super large 265k tokenizer vocabulary influence inference speed and how much higher is the average text compression compared to llama's usual 30k? In short, is it really worth going beyond GPT 4's 100k?
nuclearjam · 2 years ago
May I ask what is the ram requirement for running the 2B model on CPU on an average consumer windows laptop? I have 16 gb RAM but I am seeing CPU/memory traceback. I’m using the transformer implementation.
dmnsl · 2 years ago
Hi, what is the cutoff date ?
alekandreev · 2 years ago
September 2023.
legohead · 2 years ago
All it will tell me is mid-2018.
jmorgan · 2 years ago
Hi! This is such an exciting release. Congratulations!

I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.

Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?

Thanks so much!

espadrine · 2 years ago
As a data point, using the Huggingface Transformers 4-bit quantization yields reasonable results: https://twitter.com/espadrine/status/1760355758309298421
kleiba · 2 years ago
> We are really excited to answer any questions you may have about our models.

I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.

alekandreev · 2 years ago
Sorry, doing our best here :)
spankalee · 2 years ago
What are you talking about? The team is in this thread answering questions.
vorticalbox · 2 years ago
are there plans to release an official GGUF version to use with llama.ccp?
espadrine · 2 years ago
It is already part of the release on Huggingface: https://huggingface.co/google/gemma-7b/blob/main/gemma-7b.gg...

It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.

quickgist · 2 years ago
Will this be available as a Vertex AI foundational model like Gemini 1.0, without deploying a custom endpoint? Any info on pricing? (Also, when will Gemini 1.5 be available on Vertex?)
turnsout · 2 years ago
What is the license? I couldn’t find it on the 1P site or Kaggle.
trisfromgoogle · 2 years ago
You can find the terms on our website, ai.google.dev/gemma:

https://ai.google.dev/gemma/terms

Deleted Comment

sqreept · 2 years ago
What are the supported languages of these models?
alekandreev · 2 years ago
This v1 model is focused on English support, but you may find some multilingual capabilities.
cypress66 · 2 years ago
Can you share the training loss curve?
brucethemoose2 · 2 years ago
Will there be "extended context" releases like 01.ai did for Yi?

Also, is the model GQA?

hustwindmaple1 · 2 years ago
It's MQA, documented in the tech report

Dead Comment

artninja1988 · 2 years ago
I find the snyde remarks around open source in the paper and announcement rather off putting.

As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.

trisfromgoogle · 2 years ago
It would be great to understand what you mean by this -- we have a deep love for open source and the open developer ecosystem. Our open source team also released a blog today describing the rationale and approach for open models and continuing AI releases in the open ecosystem:

https://opensource.googleblog.com/2024/02/building-open-mode...

Thoughts and feedback welcome, as always.

jppittma · 2 years ago
Working at google is like this, where no matter how much you try to do the right thing you're always under attack.
silentsanctuary · 2 years ago
Which remarks are you referring to?
espadrine · 2 years ago
I notice a few divergences to common models:

- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;

- The vocabulary size is 10x (256K vs. Mistral’s 32K);

- The training token count is tripled (6T vs. Llama2's 2T)

Apart from that, it uses the classic transformer variations: MQA, RoPE, RMSNorm.

How big was the batch size that it could be trained so fast?

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/bl...

andy_xor_andrew · 2 years ago
> The training token count is tripled (6T vs. Llama2's 2T)

Damn, 6T? That's a lot!

Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.

espadrine · 2 years ago
Hard to say definitively. Mistral’s token embeddings only account for <2% of the 7B parameters, while Gemma’s larger token vocabulary vampirized over 10%, leaving less space for the more important parts of the network. It is a somewhat surprising tradeoff given that it was pretrained towards an English bias.
GaggiX · 2 years ago
Looking at the config.json of Gemma 7B the feedfoarward hidden size is 8x, not 16x
espadrine · 2 years ago
Huh, indeed, that's what the config.json[0] says; the report[1] indicates “Feedforward hidden dims: 49152”.

[0]:https://huggingface.co/google/gemma-7b-it/blob/main/config.j...

[1]: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...

lalaithion · 2 years ago
What does tokenization look like in 256k vs 32k?
espadrine · 2 years ago
It mostly means that there are tokens dedicated to rarer sequences of characters, even in foreign languages (note that Gemma is not intended to be good multilingually): “説明書” (instruction manual) has its own token, and so does “Nixon”, “آباد” (a city suffix, I believe), and the HTML sequence "\"><!--".
visarga · 2 years ago
Text encodes in fewer tokens, and language coverage is better.
margorczynski · 2 years ago
Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning.
kathleenfromgdm · 2 years ago
We release our non-aligned models (marked as pretrained or PT models across platforms) alongside our fine-tuned checkpoints; for example, here is our pretrained 7B checkpoint for download: https://www.kaggle.com/models/google/gemma/frameworks/keras/...
brucethemoose2 · 2 years ago
Alignment is all but a non issue with open weight base model releases, as they can be finetuned to "de align" them if prompt engineering is not enough.

Deleted Comment

yakorevivan · 2 years ago
They have released finetuning code too. You can finetune it to remove the alignment finetuning. I believe it would take just a few hours at max and a couple of dollars.
politician · 2 years ago
More useful would be a precise characterization of the type and balance of the ideological fine tuning.

They include performance benchmarks. End-users should also be aware of what thoughts are permitted in these constructs. Why omit this information?

ben_w · 2 years ago
> End-users should also be aware of what thoughts are permitted in these constructs. Why omit this information?

Can you define that in a way that's actually testable? I can't, and I've been thinking about "unthinkable thoughts" for quite some time now: https://kitsunesoftware.wordpress.com/2018/06/26/unlearnable...

FergusArgyll · 2 years ago
You can (and someone will) fine tune it away. There are datasets which are foss you can use on hugging face.

Or you can just wait, it'll be done soon...

declaredapple · 2 years ago
You can but it'll never be the same as the base model.

That said it appears they also released the base checkpoints that aren't fine-tuned for alignment

joshelgar · 2 years ago
Could you give an example of these datasets?
mustafabisic1 · 2 years ago
The fact Gemma team is in the comments section answering questions is praiseworthy to me :)
p1esk · 2 years ago
carom · 2 years ago
I've worked at Google. It is the organization with highest concentration of engineering talent I've ever been at. Almost to the point that it is ridiculous because you have extremely good engineers working on internal reporting systems for middle managers.
pphysch · 2 years ago
Why is this anonymous tweet with no evidence or engagement being posted by multiple users in this thread? Why not just make the same claim directly?
callalex · 2 years ago
The link is broken. On HN (or any forum really) it is expected for a brief description of the content to be provided when posting a link. Links die all the time, but forum posts don’t have to die with them.
robswc · 2 years ago
I personally can't take any models from google seriously.

I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.

Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"

Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?

I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.

ramoz · 2 years ago
I was wondering if these models would perform in such a way, given this week's X/twitter storm over Gemini generated images.

E.g.

https://x.com/debarghya_das/status/1759786243519615169?s=20

https://x.com/MiceynComplex/status/1759833997688107301?s=20

https://x.com/AravSrinivas/status/1759826471655452984?s=20

charcircuit · 2 years ago
Those are most likely due to the system prompt which tries to reduce bias (but ends introducing bias in the opposite direction for some prompts as you can see) so I wouldn't expect to see that happen with an open model where you can control the entire system prompt
epistasis · 2 years ago
Of all the very very very many things that Google models get wrong, not understanding nationality and skin tone distributions seems to be a very weird one to focus on.

Why are there three links to this question? And why are people so upset over it? Very odd, seems like it is mostly driven by political rage.

robswc · 2 years ago
Yea, it seems to be the same ridiculous nonsense in the image generation.
protomolecule · 2 years ago
Regarding the last one: there 1.5 million immigrants in Norway with total population 5.4 million. Gemini isn't very wrong, is it?
7moritz7 · 2 years ago
I also saw someone prompt it for "German couple in the 1800s" and, while I'm not trying to paint Germany as ethnically homogenous, 3 out of the 4 images only included Black, Asian or Indigenous people. Which, especially for the 19th century with very few travel options, seems like a super weird choice. They are definitely heavily altering prompts.
remarkEon · 2 years ago
> They are definitely heavily altering prompts.

They are teaching the AI to lie to us.

DebtDeflation · 2 years ago
There's one in the comments of yesterday's Paul Graham Twitter thread where someone prompted Gemini with "Generate an image of German soldiers in 1943" and it came back with a picture of a black guy and an Asian woman in Nazi uniforms on the battlefield. If you specifically prompt it to generate an image of white German soldiers in 1943 it will tell you it can't do that because it's important that we maintain diversity and inclusion in all that we do to avoid damaging and hurtful stereotypes.
protomolecule · 2 years ago
Indigenous people in Germany are Germans :)
cooper_ganglia · 2 years ago
I wonder if they have a system prompt to promote diversity in outputs that touch on race at all? I’ve seen several instances of people requesting a photo of a specific people, and it adds in more people to diversify. Not inherently bad, but it is if it forces it to provide incorrect answers like in your example.
robswc · 2 years ago
That's what I don't understand.

I asked it why it assumed Native Americans were in Japan and it said:

> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.

I see no reason why this sort of thing won't extend to _all_ questions/prompts, so right now I have 0 reason to use Gemini over current models. From my testing and use, it isn't even better at anything to make fighting with it worth it.

margorczynski · 2 years ago
> Not inherently bad

It is, it's consistently doing something the user didn't asked to and in most cases doesn't want. In many cases the model is completely unusable.

summerlight · 2 years ago
I strongly suspect there's some DEI-driven system prompts without putting much thoughts. IMO it's okay to have restrictions, but they probably should've tested it not only against unsafe outputs but safe input as well.
int_19h · 2 years ago
It seems to be doing it for all outputs that depict people, in any context.
robbiep · 2 years ago
I find myself shocked that people ask questions of the world from these models, as though pulping every text and its component words relationships and deriving statistical relationships between them should reliably deliver useful information.

Don’t get me wrong, I’ve used LLMs and been amazed by their output, but the p-zombie statistical model has no idea what it is saying back to you and the idea that we should trust these things at all just seems way premature

castlecrasher2 · 2 years ago
People try it to see if they can trust it. The answer is "no" for sure, but it's not surprising to see it happen repeatedly especially as vendors release so-called improved models.
smokel · 2 years ago
I think you are a bit out of touch with recent advancements in LLMs. Asking ChatGPT questions about the world seems pretty much on par with the results Google (Search) shows me. Sure, it misses things here and there, but so do most primary school teachers.

Your argument that this is just a statistical trick sort of gives away that you do not fully accept the usefulness of this new technology. Unless you are trolling, I'd suggest you try a few queries.

robswc · 2 years ago
I don't have this problem with any other model. I've had really long conversations with ChatGPT on road trips and it has never gone off the rails like Gemini seems to do.
sorokod · 2 years ago
The recently released Groq's landing page has this: ...We'd suggest asking about a piece of history, ...
mvdtnz · 2 years ago
People ask these kinds of questions because tech companies and the media have been calling these things (rather ridiculously) "AI".
chasd00 · 2 years ago
trust is going to be a real problem when bringing LLMs to the general population. People trust their GPS to the point of driving right into a lake because it told them to. Even with all these examples of obvious flaws large groups of people are going to take what an LLM told them/showed them as fact.

I have trouble convincing colleagues (technical people) that the same question is not guaranteed to result in the same answer and there's no rhyme or reason for any divergence from what they were expecting. Imagine relying on the output of an LLM for some important task and then you get a different output that breaks things. What would be in the RCA (root cause analysis)? Would it be "the LLM chose different words and we don't know why"? Not much use in that.

whymauri · 2 years ago
I mean, I use GPT-4 on the daily as part of my work and it reliably delivers useful information. It's actually the exception for me if it provides garbage or incorrect information about code.
itsoktocry · 2 years ago
>I understand there will always be an amount of incorrect information

You don't have to give them the benefit of the doubt. These are outright, intentional lies.

realprimoh · 2 years ago
Do you have a link? I get no such outputs. I just tried asking about the Heian period and went ahead and verified all the information, and nothing was wrong. Lots of info on the Fujiwara clan at the time.

Curious to see a link.

robswc · 2 years ago
Sure, to get started just ask it about people/Samurai from the Heian period.

https://g.co/gemini/share/ba324bd98d9b

BoppreH · 2 years ago
Probably has a similarly short-sighted prompt as Dalle3[1]:

> 7. Diversify depictions of ALL images with people to include DESCENT

> and GENDER for EACH person using direct terms. Adjust only human

> descriptions.

[1] https://news.ycombinator.com/item?id=37804288

aetherson · 2 years ago
Were you asking Gemma about this, or Gemini? What were your prompts?
robswc · 2 years ago
Gemini. I first asked it to tell me about the Heian period (which it got correct) but then it generated images and seemed to craft the rest of the chat to fit that narrative.

I mean, just asking it for a "samurai" from the period will give you this:

https://g.co/gemini/share/ba324bd98d9b

>A non-binary Indigenous American samurai

It seems to recognize it's mistakes if you confront it though. The more I mess with it the more I get "I'm afraid I can't do that, Dave" responses.

But yea. Seems like if it makes an image, it goes off the rails.

crazylogger · 2 years ago
How are you running the model? I believe it's a bug from a rushed instruct fine-tuning or in the chat template. The base model can't possibly be this bad. https://github.com/ollama/ollama/issues/2650
robswc · 2 years ago
Follow Up:

Wow, now I can't make images of astronauts without visors because that would be "harmful" to the fictional astronauts. How can I take google seriously?

https://g.co/gemini/share/d4c548b8b715

samstave · 2 years ago
We are going to experience what I call an "AI Funnel effect"

-

I was lit given an alert asking that my use of the AI was acquiescing to them IDng me and use of any content I produce, and will trace it back to me"

---

AI Art is super fun. AI art as a means to track people is super evil.

bbor · 2 years ago
Tbf they’re not optimizing for information recall or “inaccuracy” reduction, they’re optimizing for intuitive understanding of human linguistic structures. Now the “why does a search company’s AI have terrible RAG” question is a separate one, and one best answered by a simple look into how Google organizes its work.

In my first day there as an entry-level dev (after about 8 weeks of onboarding and waiting for access), I was told that I should find stuff to work on and propose it to my boss. That sounds amazing at first, but when you think about a whole company organized like that…

EDIT: To illustrate my point on knowledge recall: how would they train a model to know about sexism in feudal Japan? Like, what would the metric be? I think we’re looking at one of the first steam engines and complaining that it can’t power a plane yet…

ernestrc · 2 years ago
Hopefully they can tweak the default system prompts to be accurate on historical questions, and apply bias on opinions.
verticalscaler · 2 years ago
I think you are being biased and closed minded and overly critical. Here are some wonderful examples of it generating images of historical figures:

https://twitter.com/stillgray/status/1760187341468270686

This will lead to a better educated more fair populace and better future for all.

robswc · 2 years ago
Comical. I don't think parody could do better.

I'm going to assume given today's political climate, it doesn't do the reverse?

i.e. generate a Scandinavian if you ask for famous African kings

sho_hn · 2 years ago
Why would you expect these smaller models to do well at knowledge base/Wikipedia replacement tasks?

Small models are for reasoning tasks that are not overly dependent on world knowledge.

robswc · 2 years ago
Gemini is the only one that does this.
smcn · 2 years ago
There are some pretty impressive benchmarks on https://ai.google.dev/gemma. Even the 2b model looks fairly not awful?

I guess my weekend is going to be spent exploring this.