neal.fun
Verify you are human by completing the action below.
Verify you are human
neal.fun needs to review the security of your connection before proceeding.
alongside the Cloudflare logo.
neal.fun
Verify you are human by completing the action below.
Verify you are human
neal.fun needs to review the security of your connection before proceeding.
alongside the Cloudflare logo.
The weights should be the same across formats, but it's easy for differences to arise due to quantization and/or subtle implementation differences. Minor implementation differences has been a pain point in the ML ecosystem for a while (w/ IRs, onnx, python vs. runtime, etc.), but hopefully the differences aren't too significant (if they are, it's a bug in one of the implementations).
There were quantization fixes like https://twitter.com/ggerganov/status/1760418864418934922 and other patches happening, but it may take a few days for patches to work their way through the ecosystem.
I'm using the 32-bit GGUF model from the Google repo, not a different quantized model, so I could have one less source of error. It's hard to tell with LLMs if its a bug. It just gives slightly stranger answers sometimes, but it's not completely gibberish. or incoherent sentences or have extra punctuations like with some other LLM bugs I've seen.
Still, I'll wait a few days to build llama.cpp again to see if there are any changes.
To get a few common questions out of the way:
- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.
- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.
- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.
- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.
- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.
- I wrote a twitter thread on this project here: https://twitter.com/austinvhuang/status/1760375890448429459
- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.
- The README should explicitly say somehere that there's no GPU support (at the moment)
- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.
- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?
- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?
Basically it starts computing a response every time a word comes out of the speech recognizer, and if it is able to finish its response before it hears another word then it starts speaking. If more words come in then it stops speaking immediately; in other words, you can interrupt it. It feels so much more natural in conversation than ChatGPT's voice mode due to the low latency and continuous listening with the ability to interrupt.
There are a lot of things that need improvement. Most important is probably that the speech recognition system (Whisper) wasn't designed for real time and is not that reliable or efficient in a real time mode. I think some more tweaking could improve reliability considerably. But also very important is that it doesn't know when not to respond. It will always jump in if you stop speaking for a second, and it will always try to get the last word. A first pass at fixing that would be to fine tune a language model to predict whose turn it is to speak.
There are also a lot of things that this architecture will never be able to do. It will never be able to correct your pronunciation (e.g. for language learning), it will never be able to identify your emotions based on vocal cues or express proper emotions in its voice, it will never be able to hear the tone of a singing voice or produce singing itself. The future is in eliminating the boundaries between speech-to-text and LLM and text-to-speech, with one unified model trained end-to-end. Such a system would be able to do everything I mentioned and more, if trained on enough data. And further integrating vision will help with conversation too, allowing it to see the emotions on your face and take conversational cues from your gaze direction and hand gestures, in addition to all the other obvious things you can do with vision such as chat about something the camera can see or something displayed on your screen.
However, can you point me to the lectures where training happen (and architecture choices, hyperparam selection, and debugging happens.). I'm less familiar with SD but at a quick glance it seems like we're using a pretrained model and implementing bits that will eventually be useful for training but not training a new model, at least in the beginning of the deep dive notebook and first few lessons (starting at part 2, lesson 9).
If you're saying FizzBuzz doesn't work, presumably you mean that encoding the n directly doesn't work. Neither does encoding n from 0 to 1 or between -1 and 1 (and don't forget: obviously don't use relu with -1 to 1). It doesn't.
Neural networks can do a LOT of things, but they cannot deal with numbers. And they certainly cannot deal with natural or real numbers. BUT they can deal with certain encodings.
Instead of using the number directly, give one input to the neural network per bit of the number. That will work. Just pass in the last 10 bits of the number.
Or cheat and use transformers. Pass in the last 5 generations and have it construct the next FizzBuzz line. That will work. Because it's possible.
To make the number-based neural network for FizzBuzz "perfect" think about it. The neural network needs to be able to divide by 3 and 5. They can't. You can't fix that. You must make it possible for the neural network to learn the algorithm for dividing by 3 and 5 ... 2, 3 and 5 are relative primes (and actual primes). So "cheat" and pass in numbers in base 15 (by one-hot encoding the number mod 15 for example).
PM me if you'd like to debug whatever network you have together over zoom or Google meets or whatever.
https://en.wikipedia.org/wiki/One-hot
This may be catastrophically wrong. I only have a master's in machine learning (a European master's degree, meaning I've written several theses on it (didn't pass first time, had to work full time to be able to study), and I was writing captcha crackers using ConvNets in 2002. But I've never been able to convince anyone to hire me to do anything machine learning related.
You mention bag of tricks and that's indeed one issue but its worse than that because it includes knowing what "silent problems" needs a trick applied to it in the first place!
Indeed, despite using vectors everywhere, NN are bad with numerical input encoded as themselves! Its almost like the only kind of variables you can have are fixed size enums. That you then encode into vectors that are as far apart as possible, and unit vectors ("one hot vectors") do this. But that's not quite it and sometimes you can still some meaningful metric on the input that's preserved in the encoding (example: word embeddings). And so its again unclear what you can give it and what you can't.
In this toy example, I have an idea of what the shape of the solution is. But generally I do not and would not know to use a base 15 encoding or to send it the last 5 (or 15) outputs as inputs. I know you already sort of addressed this point in your last few paragraphs.
I'm still trying out toy problems at the time so it might be a "waste" of your time to troubleshoot these but I'm happy to take you up on the offer. HN doesn't have PMs though.
Do you remember when you first learned about the things you are using in your reply here? Was it in a course or just asking someone else who worked on NN for longer? I learned through by googling and finding comment threads like these! But they are not easy to collect or find together.
https://karpathy.github.io/2019/04/25/recipe/
but the general idea of "get something that can overfit first" is probably pretty good.
In my experience getting the data right is probably the most underappreciated thing. Karpathy has data as step one, but in my experience, also data representation and sampling strategy does quite the miracle.
In Part II of our book we do an end-to-end project including e.g. a moment where nothing works until we crop around "regions of interest" to balance the per-pixel classes in the training data for the UNet. This has been something I have pasted into the PyTorch forums every now and then, too.
I think I'm still at a step before the overfit. It doesn't converge to a solution on its training data (fit or overfit). And all my data is artificially generated so no cleaning is needed (though choosing a representation still matters). I don't know if that's what you mean by getting the data right or something else. Example problems that "don't work": fizzbuzz, reverse all characters in a sentence.
When I try to use transformers or any AI thing on a toy problem I come up with, it never works. Even Fizz-Buzz which I thought was easy doesn't work (because division or modulo is apparently hard to represent for NNs). And there's this blackbox of training that's hard to debug into. Yes, for the available resources, if you pick the exact same problem, the exact same NN architecture and exact same hyperparameters, it all works out. But surely they didn't get that on the first try. So what's the tweaking process?
Somehow this point isn't often talked about in courses and consequently the ones who've passed this hurdle don't get their experience transferred. I'd follow an entire course on this if it were available. An HN commenter linked me to this
https://karpathy.github.io/2019/04/25/recipe/
which is exactly on point. But it'd be great if it were one or more tutorials with a specific example, wrapped in code and peppered with many failures.
Sure, GrapheneOS is often suggested but Ubuntu Touch is a really interesting alternative, their own store and ecosystem.
The community is amazing and welcoming. If there are Android apps which you can't do without, they can be emulated and used anyway. Imagine switching to Linux and then using Wine for the apps you really still need.
Yes, it's not perfect but Linux isn't either. If you think you're sufficiently tech savvy and want to make a change, give Ubuntu Touch a try. Find a cheap second hand supported device and play around, make some fun apps. (devices currently supported: https://devices.ubuntu-touch.io/ )
To me it's like being back when there was only Windows and Macs as viable home computer OS, and people were getting their feet wet with Linux and all its flavours. Now, it's the same but for mobile.
It make some things that should be easy on Linux harder. I.e., there's no Firefox + mobile tweaks like other linux mobile OSes, in part because it wants you to use Morphic.
But other linux mobile OSes dropped support for Halium/libhybris and even the very few that still have it don't seem to match Ubuntu Touch's level of hardware support.