They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:
"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive
outputs."
According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.
Also for note on their human evaluations, Gemma 7B IT has a
51.7% win rate against Mistral v0.2 7B Instruct.
A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.
Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.
Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.
> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.
One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.
This is actually not that unusual. Stable Diffusion's license, CreativeML Open RAIL-M, has the exact same clause: "You shall undertake reasonable efforts to use the latest version of the Model."
Obviously updating the model is not very practical when you're using finetuned versions, and people still use old versions of Stable Diffusion. But it does make me fear the possibility that if they ever want to "revoke" everybody's license to use the model, all they have to do is just post a model update that's functionally useless for anything and go after anyone still using the old versions that actually do anything.
I don't think a broken model would trigger that clause in a meaningful way, because then you simply can't update with reasonable effort. You would be obliged to try the new model in a test environment, and as soon as you notice it doesn't perform and making it perform would require unreasonable effort you can simply stay on the old version.
However you might be required to update if they do more subtle changes, like a new version that only speaks positively about Google and only negatively about Microsoft. Provided this doesn't have an obvious adverse impact on your use of the model.
I don't think there's a way they can enforce that reasonably. There's no connection to the mothership to report back what version is being used or license keys at runtime...
Seems more like a "if we discover something unsafe you should update your model and we aren't liable if you don't" than something that would make your model stop working.
This kind of defensive statements in ToS are usually due to obscure regulation or leading cases and model developers need a way to limit liability. There's no practical way to enforce this, but they can claim that when bad things happen it's purely on model users rather than model developers.
This is strangely reminiscent of the Soviet Union, where after they got rid of Lavrentiy Beria, they mailed the update to subscribers of the Great Soviet Encyclopedia, where they asked to remove the three pages with Beria’s biography and replace them with the three provided pages.
Isn't this just lawyer speak for "we update our model a lot, and we've never signed off on saying we're going to support every previous release we've ever published, and may turn them off at any time, don't complain about it when we do."
Ugh, I would fully expect this kind of clause to start popping up in other software ToSes soon if it hasn't already. Contractually mandatory automatic updates.
I'm not sure how to feel about the restrictions. "No porn" feels prudish, particularly for this millennium. I tend to err on the side of freedom in intellectual/political matters; however, the others seem fairly reasonable as far as restrictions go.
Thank you very much for releasing these models! It's great to see Google enter the battle with a strong hand.
I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?
* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.
* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.
* The vocab is much larger - 256k, where most small models use 32k-64k.
* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.
These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.
EDIT: it seems this is likely an Ollama bug, please keep that in mind for the rest of this comment :)
I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:
```
Hi, . What is up? melizing with you today!
What would you like to talk about or hear from me on this fine day??
```
With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":
``` python
print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ##
guanomment messages are typically displayed using QtWidgets.MessageBox
```
Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.
Not a question, but thank you for your hard work! Also, brave of you to join the HN comments, I appreciate your openness. Hope y'all get to celebrate the launch :)
Yes models can be downloaded locally. In addition to the python NN frameworks and ggml as options, we also implemented a standalone C++ implementation that you can run locally at https://github.com/google/gemma.cpp
Mistral weights are released under an Apache 2.0 license, but Llama 2 weights are released under a proprietary license that prohibits use by large organizations and imposes usage restrictions, violating terms 5 and 6 the Open Source Definition[0]. Even if you accept that a model with a proprietary training dataset and proprietary training code can be considered "open source", there's no way Llama 2 qualifies.
For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.
Great question - we compare to the Mistral 7B 0.1 pretrained models (since there were no pretrained checkpoint updates in 0.2) and the Mistral 7B 0.2 instruction-tuned models in the technical report here: https://goo.gle/GemmaReport
Does this model also thinks german were black 200 years ago ? Or is afraid to answer basic stuff ? because if this is the case no one will care about that model.
I don't know anything about these twitter accounts so I don't know how credible they are, but here are some examples for your downvoters that I'm guessing just think you're just trolling or grossly exaggerating:
This would be really interesting in my opinion, but we are not releasing datasets at this time. See the C4 dataset for an earlier open dataset from Google.
It's cool that you guys are able to release open stuff, that must be a nice change from the modus operandi at goog. I'll have to double check but it looks like phi-2 beats your performance in some cases while being smaller, I'm guessing the value proposition of these models is being small and good while also having more knowledge baked in?
We deeply respect the Phi team and all other teams in the open model space. You’ll find that different models have different strengths and not all can be quantified with existing public evals. Take them for a spin and see what works for you.
Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.
I'm not sure if this was mentioned in the paper somewhere, but how much does the super large 265k tokenizer vocabulary influence inference speed and how much higher is the average text compression compared to llama's usual 30k? In short, is it really worth going beyond GPT 4's 100k?
May I ask what is the ram requirement for running the 2B model on CPU on an average consumer windows laptop? I have 16 gb RAM but I am seeing CPU/memory traceback. I’m using the transformer implementation.
Hi! This is such an exciting release. Congratulations!
I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.
Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?
> We are really excited to answer any questions you may have about our models.
I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.
It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.
Will this be available as a Vertex AI foundational model like Gemini 1.0, without deploying a custom endpoint? Any info on pricing? (Also, when will Gemini 1.5 be available on Vertex?)
I find the snyde remarks around open source in the paper and announcement rather off putting.
As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.
It would be great to understand what you mean by this -- we have a deep love for open source and the open developer ecosystem. Our open source team also released a blog today describing the rationale and approach for open models and continuing AI releases in the open ecosystem:
> The training token count is tripled (6T vs. Llama2's 2T)
Damn, 6T? That's a lot!
Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.
Hard to say definitively. Mistral’s token embeddings only account for <2% of the 7B parameters, while Gemma’s larger token vocabulary vampirized over 10%, leaving less space for the more important parts of the network. It is a somewhat surprising tradeoff given that it was pretrained towards an English bias.
It mostly means that there are tokens dedicated to rarer sequences of characters, even in foreign languages (note that Gemma is not intended to be good multilingually): “説明書” (instruction manual) has its own token, and so does “Nixon”, “آباد” (a city suffix, I believe), and the HTML sequence "\"><!--".
Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning.
We release our non-aligned models (marked as pretrained or PT models across platforms) alongside our fine-tuned checkpoints; for example, here is our pretrained 7B checkpoint for download: https://www.kaggle.com/models/google/gemma/frameworks/keras/...
Alignment is all but a non issue with open weight base model releases, as they can be finetuned to "de align" them if prompt engineering is not enough.
They have released finetuning code too. You can finetune it to remove the alignment finetuning. I believe it would take just a few hours at max and a couple of dollars.
I've worked at Google. It is the organization with highest concentration of engineering talent I've ever been at. Almost to the point that it is ridiculous because you have extremely good engineers working on internal reporting systems for middle managers.
The link is broken. On HN (or any forum really) it is expected for a brief description of the content to be provided when posting a link. Links die all the time, but forum posts don’t have to die with them.
I personally can't take any models from google seriously.
I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.
Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"
Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?
I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.
Those are most likely due to the system prompt which tries to reduce bias (but ends introducing bias in the opposite direction for some prompts as you can see) so I wouldn't expect to see that happen with an open model where you can control the entire system prompt
Of all the very very very many things that Google models get wrong, not understanding nationality and skin tone distributions seems to be a very weird one to focus on.
Why are there three links to this question? And why are people so upset over it? Very odd, seems like it is mostly driven by political rage.
I also saw someone prompt it for "German couple in the 1800s" and, while I'm not trying to paint Germany as ethnically homogenous, 3 out of the 4 images only included Black, Asian or Indigenous people. Which, especially for the 19th century with very few travel options, seems like a super weird choice. They are definitely heavily altering prompts.
There's one in the comments of yesterday's Paul Graham Twitter thread where someone prompted Gemini with "Generate an image of German soldiers in 1943" and it came back with a picture of a black guy and an Asian woman in Nazi uniforms on the battlefield. If you specifically prompt it to generate an image of white German soldiers in 1943 it will tell you it can't do that because it's important that we maintain diversity and inclusion in all that we do to avoid damaging and hurtful stereotypes.
I wonder if they have a system prompt to promote diversity in outputs that touch on race at all? I’ve seen several instances of people requesting a photo of a specific people, and it adds in more people to diversify. Not inherently bad, but it is if it forces it to provide incorrect answers like in your example.
I asked it why it assumed Native Americans were in Japan and it said:
> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.
I see no reason why this sort of thing won't extend to _all_ questions/prompts, so right now I have 0 reason to use Gemini over current models. From my testing and use, it isn't even better at anything to make fighting with it worth it.
I strongly suspect there's some DEI-driven system prompts without putting much thoughts. IMO it's okay to have restrictions, but they probably should've tested it not only against unsafe outputs but safe input as well.
I find myself shocked that people ask questions of the world from these models, as though pulping every text and its component words relationships and deriving statistical relationships between them should reliably deliver useful information.
Don’t get me wrong, I’ve used LLMs and been amazed by their output, but the p-zombie statistical model has no idea what it is saying back to you and the idea that we should trust these things at all just seems way premature
People try it to see if they can trust it. The answer is "no" for sure, but it's not surprising to see it happen repeatedly especially as vendors release so-called improved models.
I think you are a bit out of touch with recent advancements in LLMs. Asking ChatGPT questions about the world seems pretty much on par with the results Google (Search) shows me. Sure, it misses things here and there, but so do most primary school teachers.
Your argument that this is just a statistical trick sort of gives away that you do not fully accept the usefulness of this new technology. Unless you are trolling, I'd suggest you try a few queries.
I don't have this problem with any other model. I've had really long conversations with ChatGPT on road trips and it has never gone off the rails like Gemini seems to do.
trust is going to be a real problem when bringing LLMs to the general population. People trust their GPS to the point of driving right into a lake because it told them to. Even with all these examples of obvious flaws large groups of people are going to take what an LLM told them/showed them as fact.
I have trouble convincing colleagues (technical people) that the same question is not guaranteed to result in the same answer and there's no rhyme or reason for any divergence from what they were expecting. Imagine relying on the output of an LLM for some important task and then you get a different output that breaks things. What would be in the RCA (root cause analysis)? Would it be "the LLM chose different words and we don't know why"? Not much use in that.
I mean, I use GPT-4 on the daily as part of my work and it reliably delivers useful information. It's actually the exception for me if it provides garbage or incorrect information about code.
Do you have a link? I get no such outputs. I just tried asking about the Heian period and went ahead and verified all the information, and nothing was wrong. Lots of info on the Fujiwara clan at the time.
Gemini. I first asked it to tell me about the Heian period (which it got correct) but then it generated images and seemed to craft the rest of the chat to fit that narrative.
I mean, just asking it for a "samurai" from the period will give you this:
How are you running the model? I believe it's a bug from a rushed instruct fine-tuning or in the chat template. The base model can't possibly be this bad.
https://github.com/ollama/ollama/issues/2650
Wow, now I can't make images of astronauts without visors because that would be "harmful" to the fictional astronauts. How can I take google seriously?
Tbf they’re not optimizing for information recall or “inaccuracy” reduction, they’re optimizing for intuitive understanding of human linguistic structures. Now the “why does a search company’s AI have terrible RAG” question is a separate one, and one best answered by a simple look into how Google organizes its work.
In my first day there as an entry-level dev (after about 8 weeks of onboarding and waiting for access), I was told that I should find stuff to work on and propose it to my boss. That sounds amazing at first, but when you think about a whole company organized like that…
EDIT: To illustrate my point on knowledge recall: how would they train a model to know about sexism in feudal Japan? Like, what would the metric be? I think we’re looking at one of the first steam engines and complaining that it can’t power a plane yet…
Deleted Comment
Also phi-2.
Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.
"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive outputs."
Also for note on their human evaluations, Gemma 7B IT has a 51.7% win rate against Mistral v0.2 7B Instruct.
[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...
is the flow like this?
- take small dataset
- generate bigger dataset using mistral (how this is this done?)
- run LoRA to fine tune gemma extended dataset.
Deleted Comment
But I also tried gemma on huggingface.co/chat which I assume isn't quantized.
Deleted Comment
Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.
Something that caught my eye in the terms:
> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.
One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.
Obviously updating the model is not very practical when you're using finetuned versions, and people still use old versions of Stable Diffusion. But it does make me fear the possibility that if they ever want to "revoke" everybody's license to use the model, all they have to do is just post a model update that's functionally useless for anything and go after anyone still using the old versions that actually do anything.
Good faith possibilities: Copyright liability requires retraining, or altering the underlying training set.
Gray area: "Safety" concerns where the model recommends criminal behavior (see uncensored GPT 4 evaluations).
Bad faith: Censorship or extra weighting added based on political agenda or for-pay skewing of results.
However you might be required to update if they do more subtle changes, like a new version that only speaks positively about Google and only negatively about Microsoft. Provided this doesn't have an obvious adverse impact on your use of the model.
Seems more like a "if we discover something unsafe you should update your model and we aren't liable if you don't" than something that would make your model stop working.
sounds good.
this is not financial advice and ianal.
https://opensource.googleblog.com/2024/02/building-open-mode...
I'm not sure how to feel about the restrictions. "No porn" feels prudish, particularly for this millennium. I tend to err on the side of freedom in intellectual/political matters; however, the others seem fairly reasonable as far as restrictions go.
Deleted Comment
Deleted Comment
Dead Comment
Opinions are our own and not of Google DeepMind.
I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?
* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.
* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.
* The vocab is much larger - 256k, where most small models use 32k-64k.
* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.
These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.
https://x.com/yar_vol/status/1760314018575634842
Deleted Comment
Question
Them: :O
Deleted Comment
Might be true, might not be. It's unsourced speculation.
I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:
``` Hi, . What is up? melizing with you today!
What would you like to talk about or hear from me on this fine day?? ```
With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":
``` python print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ## guanomment messages are typically displayed using QtWidgets.MessageBox ```
Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.
https://twitter.com/ggerganov/status/1760293079313973408
Or is your definition of "open" different?
[0] https://github.com/ggerganov/llama.cpp/pull/5631
For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.
[0] https://en.wikipedia.org/wiki/The_Open_Source_Definition
[1] https://en.wikipedia.org/wiki/Source-available_software
We all know that Google thinks that saying that 1800s English kings were white is "harmful".
Also note some of the links on the blog post don't work, e.g debugging tool.
Are there plans for MoE or 70B models?
https://twitter.com/aginnt/status/1760159436323123632
https://twitter.com/Black_Pilled/status/1760198299443966382
Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.
I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.
Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?
Thanks so much!
I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.
It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.
https://ai.google.dev/gemma/terms
Deleted Comment
Also, is the model GQA?
Dead Comment
As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.
https://opensource.googleblog.com/2024/02/building-open-mode...
Thoughts and feedback welcome, as always.
- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;
- The vocabulary size is 10x (256K vs. Mistral’s 32K);
- The training token count is tripled (6T vs. Llama2's 2T)
Apart from that, it uses the classic transformer variations: MQA, RoPE, RMSNorm.
How big was the batch size that it could be trained so fast?
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/bl...
Damn, 6T? That's a lot!
Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.
[0]:https://huggingface.co/google/gemma-7b-it/blob/main/config.j...
[1]: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...
Deleted Comment
They include performance benchmarks. End-users should also be aware of what thoughts are permitted in these constructs. Why omit this information?
Can you define that in a way that's actually testable? I can't, and I've been thinking about "unthinkable thoughts" for quite some time now: https://kitsunesoftware.wordpress.com/2018/06/26/unlearnable...
Or you can just wait, it'll be done soon...
That said it appears they also released the base checkpoints that aren't fine-tuned for alignment
I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.
Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"
Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?
I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.
E.g.
https://x.com/debarghya_das/status/1759786243519615169?s=20
https://x.com/MiceynComplex/status/1759833997688107301?s=20
https://x.com/AravSrinivas/status/1759826471655452984?s=20
Why are there three links to this question? And why are people so upset over it? Very odd, seems like it is mostly driven by political rage.
They are teaching the AI to lie to us.
I asked it why it assumed Native Americans were in Japan and it said:
> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.
I see no reason why this sort of thing won't extend to _all_ questions/prompts, so right now I have 0 reason to use Gemini over current models. From my testing and use, it isn't even better at anything to make fighting with it worth it.
It is, it's consistently doing something the user didn't asked to and in most cases doesn't want. In many cases the model is completely unusable.
Don’t get me wrong, I’ve used LLMs and been amazed by their output, but the p-zombie statistical model has no idea what it is saying back to you and the idea that we should trust these things at all just seems way premature
Your argument that this is just a statistical trick sort of gives away that you do not fully accept the usefulness of this new technology. Unless you are trolling, I'd suggest you try a few queries.
I have trouble convincing colleagues (technical people) that the same question is not guaranteed to result in the same answer and there's no rhyme or reason for any divergence from what they were expecting. Imagine relying on the output of an LLM for some important task and then you get a different output that breaks things. What would be in the RCA (root cause analysis)? Would it be "the LLM chose different words and we don't know why"? Not much use in that.
You don't have to give them the benefit of the doubt. These are outright, intentional lies.
Curious to see a link.
https://g.co/gemini/share/ba324bd98d9b
> 7. Diversify depictions of ALL images with people to include DESCENT
> and GENDER for EACH person using direct terms. Adjust only human
> descriptions.
[1] https://news.ycombinator.com/item?id=37804288
I mean, just asking it for a "samurai" from the period will give you this:
https://g.co/gemini/share/ba324bd98d9b
>A non-binary Indigenous American samurai
It seems to recognize it's mistakes if you confront it though. The more I mess with it the more I get "I'm afraid I can't do that, Dave" responses.
But yea. Seems like if it makes an image, it goes off the rails.
Wow, now I can't make images of astronauts without visors because that would be "harmful" to the fictional astronauts. How can I take google seriously?
https://g.co/gemini/share/d4c548b8b715
-
I was lit given an alert asking that my use of the AI was acquiescing to them IDng me and use of any content I produce, and will trace it back to me"
---
AI Art is super fun. AI art as a means to track people is super evil.
In my first day there as an entry-level dev (after about 8 weeks of onboarding and waiting for access), I was told that I should find stuff to work on and propose it to my boss. That sounds amazing at first, but when you think about a whole company organized like that…
EDIT: To illustrate my point on knowledge recall: how would they train a model to know about sexism in feudal Japan? Like, what would the metric be? I think we’re looking at one of the first steam engines and complaining that it can’t power a plane yet…
https://twitter.com/stillgray/status/1760187341468270686
This will lead to a better educated more fair populace and better future for all.
I'm going to assume given today's political climate, it doesn't do the reverse?
i.e. generate a Scandinavian if you ask for famous African kings
Small models are for reasoning tasks that are not overly dependent on world knowledge.
I guess my weekend is going to be spent exploring this.