A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).
First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.
I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).
My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.
While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.
These findings have been confirmed by others, including by the BLOOM open source LM project.
Impressive model, thank you for releasing it under a business-friendly license!
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?
Broadly finetuning is any post pretraining training. Most of the time it is used in the context of fitting a more narrow task. In our case, it was the same training objective as the pretraining but meant to be more representative of what Replit users like to code. However, we were surprised by how well it boosted overall performance. Best guess: it's a) novel data and b) the model could take even more training!!
You can take a network and its weights that someone else trained, and use that pretrained network to train on your own data, which is likely to be a better starting point than random weights.
The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.
> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.
So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like
> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]
And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).
In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)
1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.
2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.
The model is way too small, comparing it to Codex feels disingenous. Sure it's 77% smaller, it's also 77% worse. Although, it's a cool project nonetheless.
For instance, even this simple snippet generates wrong inline completions:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) =>
Replit-code-v1:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) => {
return array.filter((item) => item > 10);
};
Gets it wrong, returns odd numbers.
Codeium:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) => {
return array.filter((num) => num > 10 && num % 2 === 0);
};
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned:
const arrayFilter = (array) => {
return array.filter(num => num % 2 === 0 && num > 10);
}
Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.
That's really interesting, indeed I can reproduce this by changing the comment. I also managed to get correct output for this sample by renaming the function.
It seems like every week someone comes out with some version of "we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"
And then you dig in, and it's always far behind in some important way.
Not hating here, I love the pace of iteration, just not the hyperbole.
Yeah I tried the demo, it wrote some wrong code with comments in Chinese. I think I'll pass.
It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.
tl;dr - a "fully" trained small model can outperform a "undertrained" larger model. If you have a fixed amount of compute (budget), then you need to optimize for the largest model that you can fully train, and not simply up the parameter count.
EDIT: Also you can't necessarily compare the parameter count across model architectures*
This thing seems to outperform the finetuned 30B llama models I've seen.
hi HN! back again with an exclusive deep dive with Replit’s head of AI. I attended their developer day last week (https://twitter.com/swyx/status/1650989632413401089) just expecting a regular fundraise announcement and was totally shocked when they annoucned their own LLM and also said they would open source it. so immediately asked them for a podcast interview and this is the result.
my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)
It's not crucial that it beat ChatGPT this year. That's a pretty unattainable goal for a group like Replit. From the users POV, none of the current copilots compare favorably against ChatGPT, even Microsoft's OpenAI-powered GitHub Copilot.
What's important is that they're preparing for the future by building all the tooling/UI/UX around coding copilots. This way, when costs and feasibility of building ChatGPT-quality LLM's drop and multiple open-source models are available, Replit has the ability to immediately drop them into their production environment. They'll also have the skills and systems to finetune any new models and wring extra performance out of them.
This is more important to users than it seems at first because current UX of things like GitHub Copilot don't allow me to use their AI against my codebase the way that I want to (the way I use ChatGPT). Right now GitHub Copilot is a glorified auto-complete, but I want it to do widespread scaffolding, refactoring, and analysis across my whole codebase. Microsoft has access to LLM's that can do this through their control of OpenAI -- but Microsoft lacks the tooling/UI/UX to bring the power of ChatGPT to me as a user of VSCode/IntelliJ/PyCharm/Visual Studio.
So if Replit can find more innovative, boundary-pushing ways of integrating LLM's, they won't necessarily need the highest quality LLM's to produce a superior user experience. It's a strong signal that Replit is well-positioned for the future, when ChatGPT-like models are democratized.
Hopefully JetBrains is paying attention. They definitely have time to wait a bit more (1-2 years?), but not a lot of time. JetBrains shouldn't solely rely on Github Copilot plug-in to provide their users with LLM's, because it's not clear that the user experience of that plug-in will stay competitive with the user experience that GitHub Copilot will offer directly in VSCode. The IntelliJ/PyCharm plugin may remain "just a fancy auto-complete" while VSCode gets more interactive workflows.
Future IDE's with LLM integration require novel, smart, clever UX typically invented only by very creative people.
It's also worth noting that Replit is not just trying to be an IDE -- they're also building a marketplace to buy/sell coding work, and establishing a small foothold as a niche cloud computing provider.
it doesn't. replit-code-v1-3b is a code LLM, ChatGPT is an app on top of LLMs. it compares to OpenAI Codex, a small version of which is behind GitHub Copilot.
Hard to compare them actually. The thing about ChatGPT is the chat part. It was trained to interact and respond with human conversation. This is more like CodePilot, with code complete based off of actual code
Replit: I tried to write HMM Forward function code:
```
# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,
# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.
#
def HMM_forward() :
# Write your code here
# Hint:
# - Remember that HMMs are stochastic, so you will need to multiply transition matrices by observation probabilities
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by observation probabilities
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices
# - Remember that HMMs are stochastic, so
```
It gets stuck in an infinite loop.
ChatGPT gets it write, first shot:
```
import numpy as np
def forward_algorithm(A, B, pi, obs):
"""
Performs the forward algorithm for an HMM.
Parameters:
A: numpy.ndarray, shape (N, N)
Transition matrix of the HMM, where N is the number of states.
B: numpy.ndarray, shape (N, M)
Emission matrix of the HMM, where M is the number of possible observations.
pi: numpy.ndarray, shape (N,)
Initial probability distribution over states.
obs: numpy.ndarray, shape (T,)
Sequence of T observations.
Returns:
alpha: numpy.ndarray, shape (T, N)
Forward probabilities for each state at each time step.
"""
T = obs.shape[0]
N = A.shape[0]
alpha = np.zeros((T, N))
alpha[0] = pi * B[:, obs[0]]
for t in range(1, T):
alpha[t] = np.dot(alpha[t-1], A) * B[:, obs[t]]
return alpha
```
OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday
The model is not RLHF'd or instructed. It's an inline autocomplete model so it will get confused if you talk it like you're talking to a person. Altho it is possible to finetune it this way. To get better full function completion try giving it the function definition and a descriptive docstring as a prompt.
Really? My experience with GPT is more the description I add the better the results. I presume this is because it has a longer prompt to attend upon, I think the whole idea of focusing on keywords/ concise sentences is a very “search engine” paradigm and language models do better the more you describe your question
LLMs generally but more so small models will keep going and generate seemingly unrelated things. On the frontend tools like Copilot and Ghostwriter do a lot of things like use stopwords or simply not show completions outside a single block.
As for your prompt, it's following your prompt a little too closely and generating just the function. You can however condition it that this is the start of the program it will do the import, e.g.
# python function that returns a random integer between min and max
import
That's because it's not following instructions like ChatGPT, it's just trying to guess that could plausibly come after what you put, like Copilot or the old GPT-3 models
I've had the issue of generating random code after the completion with other models as well; it's due to how the models are trained. You need to stop generating when you encounter token(s) that indicate you're done - see https://huggingface.co/replit/replit-code-v1-3b#post-process...
This is amazing work and bravo on to the people working on redpajama.
This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.
Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).
My first reaction was, "why is replit building LLMs," but I guess it fits their needs to have one optimized for their use. But I wonder, is this the beginning of another wave of "every company is an AI company?" Are we going to see a spike in tech hiring around AI/LLM, money starting to flow again, etc? And how many years until it all blows up and the layoffs start?
Finetuning models and LLMs (and any model) is going to a be common practice . Each company is its own domain, which domain knowledge and data to specialize open sourced models or used other models to distill/teach their own proprietary model (home grown or modify someone else's).
No Clojure. No Julia. No Haskell. No Racket. No Scheme. No Common Lisp. No OCaml. And, as much as I despise Microsoft, No C#. No F#. No Swift. No Objective-C. No Perl. No Datalog. A glaringly lacking choice of languages.
Despite the lack of examples, it still completes trivial clojure like "(defn connect [" and other lisp syntax like "(define (hello" which is promising for further refinement training on Lisp languages.
And probably, yes. While it contains 358 programming languages, obviously there's a long tail after the 20 most-represented languages. Some people might not expect without thinking about it for a bit that many of the most-represented "languages" are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, SVG.
Also note that it won't be able to parse natural language nearly as well without additionally being trained on something like the LAION dataset, so this version will be more of an autocomplete like Copilot rather than something which can manifest high level business logic from whole cloth like ChatGPT.
This is a bit hard to believe that the system is decent at producing code which captures complex ideas and higher level structure when the tokens/param value is >30 (it's ~200 here? ) The 'good' models (meaning having lots of 'knowledge' or 'memorization' about the dataset) typically tend to be around 2 tokens/param and models with decent generation of language with less knowledge/memorization are around 30 tokens/param.
Perhaps the domain allows for this, but due to the fact that the linguistic interface on the input is still needed... It's hard to believe.
Tokens/param shouldn't matter more than the total training FLOPs, I believe. Clearly if we train a your claimed 'ideal' 2 tokens/param a very small dataset (not many tokens in the first place), it wouldn't have enough data to properly learn the relevant languages. Once there is enough data, then it becomes a question of model capacity (does it have enough degrees of freedom to support the computational structures needed?).
I believe the overparametrization largely helps with generalization and reducing overfitting, at 2 tokens/param there's much more degrees of freedom than structures that can be learned from what I can tell (the extra capacity just provides good breathing room for internal structures). But if your model has enough capacity, and you can find a good enough training method (and you have enough data to learn the task), then you should be able to succeed in arbitrary low tokens/param, which is good to keep in mind to make efficient models.
this kind of critical thinking is exactly what replit is going to need for their stated goal of doing whole-app generation. right now they only test it on AmjadEval. you… might wanna consider joining them to work on it?
I'm not sure noticing tokens/params or simplicial modeling properties occur requires much critical thought - perhaps it's just a standard first thought for anyone given an LM now.
I've worked tangentially to NLP for about 7 years in academia, but most of my work has been focused on a less known field of mathematics applied to either to outputs or to the NNs, as well as bioinformatics. As such, my expertise may not be as refined as the real players in the field such as Glaese, Finn, Velockovic, etc. but I try typically keep up with the actual key advancements in the field (usually the stuff few people notice). This area takes far too much compute capability for many people to actually become experts in it, so much of my interests have been less on large LMs and more towards algorithms.
But I suppose I agree that it is frustrating to see how little knowledge many of the hype-filled crowds possess that are piling into this area. (Not calling anyone specifically out in this thread, just in general across the internet)
i believe GP is referencing the Kaplan and Chinchilla scaling laws. we reference those in the podcast but i’m not sure if some deeper insight is being hinted at here where different scaling laws apply for different domains/purposes
- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...
- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b
- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo
- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081
A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).
I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).
My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.
While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.
These findings have been confirmed by others, including by the BLOOM open source LM project.
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.
Hope you get to look into this!
Like why did we even get excited? This? Great work.
is that a guess or is there a source? im curious to read more
The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.
So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like
> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]
And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).
In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)
[0] https://arxiv.org/abs/2107.03374
[1] https://github.com/openai/code-align-evals-data/blob/97446d9...
[2] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...
[3] https://github.com/danielwatson6/hate-speech-project/blob/64...
[4] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...
1 - Why did you choose Markdown? It seems an odd choice for training a model like this.
2 - Have you tried to train only one single PL and then benchmark it against this more general version?
2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.
Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.
More details from them here: https://blog.replit.com/llm-training
Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/
For instance, even this simple snippet generates wrong inline completions:
Replit-code-v1: Gets it wrong, returns odd numbers.Codeium:
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned: Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.For example, if the instruction says "return person objects that are at least 20 years old", it might be more reasonable to generate:
array.filter(item => item.age >= 20)
as oppose to
array.filter(item => (item instanceof Person) && (item.age >= 20))
And then you dig in, and it's always far behind in some important way.
Not hating here, I love the pace of iteration, just not the hyperbole.
I have felt similar frustrations with statements that feel disingenuous too. Thanks for articulating this with such a beautifully hilarious metaphor.
On first look this seems to blow the current llama based models out of the water including the 30B ones.
Pasting what you want + url + example json with no other context and it "knows" what the url and the json is for, without even telling it.
I'm not even saying it's as good as chatGPT, but this is a tenth the size of the best llama models I've seen.
It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.
That's not true, the amount of training is a MAJOR factor.
See the Chinchilla paper - https://arxiv.org/abs/2203.15556
tl;dr - a "fully" trained small model can outperform a "undertrained" larger model. If you have a fixed amount of compute (budget), then you need to optimize for the largest model that you can fully train, and not simply up the parameter count.
EDIT: Also you can't necessarily compare the parameter count across model architectures*
This thing seems to outperform the finetuned 30B llama models I've seen.
Hehe, yeah, imagine saying you made a new programming language with 77% less lines of code than Python.
my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)
please AMA!
What's important is that they're preparing for the future by building all the tooling/UI/UX around coding copilots. This way, when costs and feasibility of building ChatGPT-quality LLM's drop and multiple open-source models are available, Replit has the ability to immediately drop them into their production environment. They'll also have the skills and systems to finetune any new models and wring extra performance out of them.
This is more important to users than it seems at first because current UX of things like GitHub Copilot don't allow me to use their AI against my codebase the way that I want to (the way I use ChatGPT). Right now GitHub Copilot is a glorified auto-complete, but I want it to do widespread scaffolding, refactoring, and analysis across my whole codebase. Microsoft has access to LLM's that can do this through their control of OpenAI -- but Microsoft lacks the tooling/UI/UX to bring the power of ChatGPT to me as a user of VSCode/IntelliJ/PyCharm/Visual Studio.
So if Replit can find more innovative, boundary-pushing ways of integrating LLM's, they won't necessarily need the highest quality LLM's to produce a superior user experience. It's a strong signal that Replit is well-positioned for the future, when ChatGPT-like models are democratized.
Hopefully JetBrains is paying attention. They definitely have time to wait a bit more (1-2 years?), but not a lot of time. JetBrains shouldn't solely rely on Github Copilot plug-in to provide their users with LLM's, because it's not clear that the user experience of that plug-in will stay competitive with the user experience that GitHub Copilot will offer directly in VSCode. The IntelliJ/PyCharm plugin may remain "just a fancy auto-complete" while VSCode gets more interactive workflows.
Future IDE's with LLM integration require novel, smart, clever UX typically invented only by very creative people.
It's also worth noting that Replit is not just trying to be an IDE -- they're also building a marketplace to buy/sell coding work, and establishing a small foothold as a niche cloud computing provider.
```
# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,
# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.
# def HMM_forward() :
``` It gets stuck in an infinite loop. ChatGPT gets it write, first shot:```
import numpy as np
def forward_algorithm(A, B, pi, obs):
``` OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities somedayDeleted Comment
Stuff like this will make your outcomes worse for any model.
I tried a prompt of:
And it produced: It doesn't add the needed import statement, and I'm unclear why it's "defining the size of the grid".As for your prompt, it's following your prompt a little too closely and generating just the function. You can however condition it that this is the start of the program it will do the import, e.g.
This is in fact a suggestion from OpenAI on best practices for prompting called "leading words" https://help.openai.com/en/articles/6654000-best-practices-f...This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.
Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).
AMAZING WORK.
The point is that there will be alternatives and that will reduce the price in time further increasing the impact of the technology.
There was a possible future where only MSFT and maybe GOOG and maybe one or two other companies had this technology and extracted massive rents.
I thought I read that, is it based upon:
https://arxiv.org/abs/2211.15533 (The Stack) ?
https://try.ocamlpro.com/#code/type'point'='$4'x:'int;'y':'i...
And probably, yes. While it contains 358 programming languages, obviously there's a long tail after the 20 most-represented languages. Some people might not expect without thinking about it for a bit that many of the most-represented "languages" are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, SVG.
Also note that it won't be able to parse natural language nearly as well without additionally being trained on something like the LAION dataset, so this version will be more of an autocomplete like Copilot rather than something which can manifest high level business logic from whole cloth like ChatGPT.
I believe the overparametrization largely helps with generalization and reducing overfitting, at 2 tokens/param there's much more degrees of freedom than structures that can be learned from what I can tell (the extra capacity just provides good breathing room for internal structures). But if your model has enough capacity, and you can find a good enough training method (and you have enough data to learn the task), then you should be able to succeed in arbitrary low tokens/param, which is good to keep in mind to make efficient models.