Readit News logoReadit News
amasad · 3 years ago
Some links:

- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...

- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b

- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo

- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081

A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).

It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.

Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).

newhouseb · 3 years ago
First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.

I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).

My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.

ofirpress · 3 years ago
(I wrote ALiBi) You can read the paper here https://arxiv.org/abs/2108.12409

While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.

These findings have been confirmed by others, including by the BLOOM open source LM project.

kir-gadjello · 3 years ago
Impressive model, thank you for releasing it under a business-friendly license!

Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.

Hope you get to look into this!

b33j0r · 3 years ago
Thank you for releasing the weights along with the announcement. The posts that made great headlines, but “weights are on their way!”

Like why did we even get excited? This? Great work.

swyx · 3 years ago
> I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

is that a guess or is there a source? im curious to read more

chaxor · 3 years ago
I don't think it's a business friendly license?
sputknick · 3 years ago
What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?
amasad · 3 years ago
Broadly finetuning is any post pretraining training. Most of the time it is used in the context of fitting a more narrow task. In our case, it was the same training objective as the pretraining but meant to be more representative of what Replit users like to code. However, we were surprised by how well it boosted overall performance. Best guess: it's a) novel data and b) the model could take even more training!!
WinLychee · 3 years ago
You can take a network and its weights that someone else trained, and use that pretrained network to train on your own data, which is likely to be a better starting point than random weights.
spenczar5 · 3 years ago
How is this code licensed? I didn't see a license in the repo. It looks interesting!
dgacmu · 3 years ago
The README indicates:

The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.

letitgo12345 · 3 years ago
Doesn't the Stack contain HumanEval? So you're basically comparing numbers on the pretraining data.
amasad · 3 years ago
Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.
godelski · 3 years ago
My favorite line from the HumanEval paper[0]

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.

So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like

> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]

And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).

In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)

[0] https://arxiv.org/abs/2107.03374

[1] https://github.com/openai/code-align-evals-data/blob/97446d9...

[2] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...

[3] https://github.com/danielwatson6/hate-speech-project/blob/64...

[4] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...

pera · 3 years ago
Hi there, I have two question:

1 - Why did you choose Markdown? It seems an odd choice for training a model like this.

2 - Have you tried to train only one single PL and then benchmark it against this more general version?

amasad · 3 years ago
1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.

2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.

runnerup · 3 years ago
They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training

curiousgal · 3 years ago
Did any interns help in developing this? If so are you planning on intimidating them as usual? :)

Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/

robertlagrant · 3 years ago
Wow. That's extremely poor behaviour if the account is accurate.
gbasin · 3 years ago
Very exciting, thanks for sharing all this
doodlesdev · 3 years ago
The model is way too small, comparing it to Codex feels disingenous. Sure it's 77% smaller, it's also 77% worse. Although, it's a cool project nonetheless.

For instance, even this simple snippet generates wrong inline completions:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) =>
Replit-code-v1:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((item) => item > 10);
   };
Gets it wrong, returns odd numbers.

Codeium:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((num) => num > 10 && num % 2 === 0);
   };
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned:

   const arrayFilter = (array) => {
     return array.filter(num => num % 2 === 0 && num > 10);
   }
Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.

amasad · 3 years ago
Interesting. This seems like a weakness of natural language understanding. If you rephrase your prompt slightly it would get it right. Try:

  // return even numbers that are also more than 10
  const arrayFilter = (array) =>
It would do the right thing. The fine-tuned version gets your prompt right so maybe it benefited from natural language data. Will look more into it.

doodlesdev · 3 years ago
That's really interesting, indeed I can reproduce this by changing the comment. I also managed to get correct output for this sample by renaming the function.
SCLeo · 3 years ago
I agree. Maybe it interpreted it as return the numbers that are more than 10 in the given array of even numbers.

For example, if the instruction says "return person objects that are at least 20 years old", it might be more reasonable to generate:

array.filter(item => item.age >= 20)

as oppose to

array.filter(item => (item instanceof Person) && (item.age >= 20))

SheinhardtWigCo · 3 years ago
It seems like every week someone comes out with some version of "we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"

And then you dig in, and it's always far behind in some important way.

Not hating here, I love the pace of iteration, just not the hyperbole.

barking_biscuit · 3 years ago
>"we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"

I have felt similar frustrations with statements that feel disingenuous too. Thanks for articulating this with such a beautifully hilarious metaphor.

thewataccount · 3 years ago
I need more time to compare it, the short 128 tokens in the demo is a bit rough but -

On first look this seems to blow the current llama based models out of the water including the 30B ones.

Pasting what you want + url + example json with no other context and it "knows" what the url and the json is for, without even telling it.

I'm not even saying it's as good as chatGPT, but this is a tenth the size of the best llama models I've seen.

moffkalast · 3 years ago
Yeah I tried the demo, it wrote some wrong code with comments in Chinese. I think I'll pass.

It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.

thewataccount · 3 years ago
> It's a pretty well accepted fact now that bigger LLM = moar better without exceptions.

That's not true, the amount of training is a MAJOR factor.

See the Chinchilla paper - https://arxiv.org/abs/2203.15556

tl;dr - a "fully" trained small model can outperform a "undertrained" larger model. If you have a fixed amount of compute (budget), then you need to optimize for the largest model that you can fully train, and not simply up the parameter count.

EDIT: Also you can't necessarily compare the parameter count across model architectures*

This thing seems to outperform the finetuned 30B llama models I've seen.

johnfn · 3 years ago
> Sure it's 77% smaller, it's also 77% worse.

Hehe, yeah, imagine saying you made a new programming language with 77% less lines of code than Python.

Zababa · 3 years ago
Finally, an opportunity to share this https://nsl.com/papers/denial.html
swyx · 3 years ago
hi HN! back again with an exclusive deep dive with Replit’s head of AI. I attended their developer day last week (https://twitter.com/swyx/status/1650989632413401089) just expecting a regular fundraise announcement and was totally shocked when they annoucned their own LLM and also said they would open source it. so immediately asked them for a podcast interview and this is the result.

my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)

please AMA!

marcodiego · 3 years ago
Sorry, I have to ask this: how does this compare to ChatGPT?
runnerup · 3 years ago
It's not crucial that it beat ChatGPT this year. That's a pretty unattainable goal for a group like Replit. From the users POV, none of the current copilots compare favorably against ChatGPT, even Microsoft's OpenAI-powered GitHub Copilot.

What's important is that they're preparing for the future by building all the tooling/UI/UX around coding copilots. This way, when costs and feasibility of building ChatGPT-quality LLM's drop and multiple open-source models are available, Replit has the ability to immediately drop them into their production environment. They'll also have the skills and systems to finetune any new models and wring extra performance out of them.

This is more important to users than it seems at first because current UX of things like GitHub Copilot don't allow me to use their AI against my codebase the way that I want to (the way I use ChatGPT). Right now GitHub Copilot is a glorified auto-complete, but I want it to do widespread scaffolding, refactoring, and analysis across my whole codebase. Microsoft has access to LLM's that can do this through their control of OpenAI -- but Microsoft lacks the tooling/UI/UX to bring the power of ChatGPT to me as a user of VSCode/IntelliJ/PyCharm/Visual Studio.

So if Replit can find more innovative, boundary-pushing ways of integrating LLM's, they won't necessarily need the highest quality LLM's to produce a superior user experience. It's a strong signal that Replit is well-positioned for the future, when ChatGPT-like models are democratized.

Hopefully JetBrains is paying attention. They definitely have time to wait a bit more (1-2 years?), but not a lot of time. JetBrains shouldn't solely rely on Github Copilot plug-in to provide their users with LLM's, because it's not clear that the user experience of that plug-in will stay competitive with the user experience that GitHub Copilot will offer directly in VSCode. The IntelliJ/PyCharm plugin may remain "just a fancy auto-complete" while VSCode gets more interactive workflows.

Future IDE's with LLM integration require novel, smart, clever UX typically invented only by very creative people.

It's also worth noting that Replit is not just trying to be an IDE -- they're also building a marketplace to buy/sell coding work, and establishing a small foothold as a niche cloud computing provider.

swyx · 3 years ago
it doesn't. replit-code-v1-3b is a code LLM, ChatGPT is an app on top of LLMs. it compares to OpenAI Codex, a small version of which is behind GitHub Copilot.
heliophobicdude · 3 years ago
Hard to compare them actually. The thing about ChatGPT is the chat part. It was trained to interact and respond with human conversation. This is more like CodePilot, with code complete based off of actual code
FanaHOVA · 3 years ago
This was a lot of fun to record, and second episode where I get an eval question wrong, I'm going to be demoted to bot soon lol
swyx · 3 years ago
means you are human! like the rest of us
swyx · 3 years ago
we also did an interview with Varun Mohan of Codeium, which is another competing code model trained from complete scratch: https://lspace.swyx.io/p/varun-mohan#details
sashank_1509 · 3 years ago
Replit: I tried to write HMM Forward function code:

```

# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,

# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.

# def HMM_forward() :

    # Write your code here

    # Hint: 

    # - Remember that HMMs are stochastic, so you will need to multiply transition matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so
``` It gets stuck in an infinite loop. ChatGPT gets it write, first shot:

```

import numpy as np

def forward_algorithm(A, B, pi, obs):

    """
    Performs the forward algorithm for an HMM.

    Parameters:
        A: numpy.ndarray, shape (N, N)
            Transition matrix of the HMM, where N is the number of states.
        B: numpy.ndarray, shape (N, M)
            Emission matrix of the HMM, where M is the number of possible observations.
        pi: numpy.ndarray, shape (N,)
            Initial probability distribution over states.
        obs: numpy.ndarray, shape (T,)
            Sequence of T observations.
    
    Returns:
        alpha: numpy.ndarray, shape (T, N)
            Forward probabilities for each state at each time step.
    """

    T = obs.shape[0]
    N = A.shape[0]

    alpha = np.zeros((T, N))
    alpha[0] = pi * B[:, obs[0]]

    for t in range(1, T):
        alpha[t] = np.dot(alpha[t-1], A) * B[:, obs[t]]

    return alpha
``` OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday

amasad · 3 years ago
The model is not RLHF'd or instructed. It's an inline autocomplete model so it will get confused if you talk it like you're talking to a person. Altho it is possible to finetune it this way. To get better full function completion try giving it the function definition and a descriptive docstring as a prompt.

Deleted Comment

fauxpause_ · 3 years ago
> But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.

Stuff like this will make your outcomes worse for any model.

sashank_1509 · 3 years ago
Really? My experience with GPT is more the description I add the better the results. I presume this is because it has a longer prompt to attend upon, I think the whole idea of focusing on keywords/ concise sentences is a very “search engine” paradigm and language models do better the more you describe your question
tyingq · 3 years ago
More tools in the field is great! I tried a few things, and it's reasonable, but it does have some quirks that seem to repeat, like:

I tried a prompt of:

  # python function that returns a random integer between min and max
And it produced:

  def random_int(min, max):
      return random.randint(min, max)

  # define the size of the grid
  n = 5
It doesn't add the needed import statement, and I'm unclear why it's "defining the size of the grid".

amasad · 3 years ago
LLMs generally but more so small models will keep going and generate seemingly unrelated things. On the frontend tools like Copilot and Ghostwriter do a lot of things like use stopwords or simply not show completions outside a single block.

As for your prompt, it's following your prompt a little too closely and generating just the function. You can however condition it that this is the start of the program it will do the import, e.g.

   # python function that returns a random integer between min and max
   import
This is in fact a suggestion from OpenAI on best practices for prompting called "leading words" https://help.openai.com/en/articles/6654000-best-practices-f...

circuit10 · 3 years ago
That's because it's not following instructions like ChatGPT, it's just trying to guess that could plausibly come after what you put, like Copilot or the old GPT-3 models
jeremyjh · 3 years ago
Isn’t ChatGPT also just generating plausible text that could be a response to an instruction?
minkzilla · 3 years ago
and imports are (almost) always at the top of your file not with this function
tyingq · 3 years ago
Based on the the replies, I tried a different prompt:

  # python script that prints out an integer between min and max
And it did better. Included the import, didn't add unrelated code, but did still put the code inside a function.

radq · 3 years ago
I've had the issue of generating random code after the completion with other models as well; it's due to how the models are trained. You need to stop generating when you encounter token(s) that indicate you're done - see https://huggingface.co/replit/replit-code-v1-3b#post-process...
agilob · 3 years ago
I get such unrelated statements from copilot too, not often, but a few I remember.
GreedClarifies · 3 years ago
This is amazing work and bravo on to the people working on redpajama.

This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.

Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).

AMAZING WORK.

laweijfmvo · 3 years ago
My first reaction was, "why is replit building LLMs," but I guess it fits their needs to have one optimized for their use. But I wonder, is this the beginning of another wave of "every company is an AI company?" Are we going to see a spike in tech hiring around AI/LLM, money starting to flow again, etc? And how many years until it all blows up and the layoffs start?
dpflan · 3 years ago
Finetuning models and LLMs (and any model) is going to a be common practice . Each company is its own domain, which domain knowledge and data to specialize open sourced models or used other models to distill/teach their own proprietary model (home grown or modify someone else's).
m3kw9 · 3 years ago
Have you even tried it? It’s pretty bad
GreedClarifies · 3 years ago
But that's fine it can be a year or two behind the state of the art. That's not the point.

The point is that there will be alternatives and that will reduce the price in time further increasing the impact of the technology.

There was a possible future where only MSFT and maybe GOOG and maybe one or two other companies had this technology and extracted massive rents.

swyx · 3 years ago
to be clear this work is not based on redpajama - though we did discuss that in the previous episode https://twitter.com/swyx/status/1648080532734087168?s=46&t=9...
GreedClarifies · 3 years ago
Oh my bad!

I thought I read that, is it based upon:

https://arxiv.org/abs/2211.15533 (The Stack) ?

waffletower · 3 years ago
No Clojure. No Julia. No Haskell. No Racket. No Scheme. No Common Lisp. No OCaml. And, as much as I despise Microsoft, No C#. No F#. No Swift. No Objective-C. No Perl. No Datalog. A glaringly lacking choice of languages.
ubertaco · 3 years ago
I fed it some OCaml and it worked, though the example was trivial:

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =
which it completed to

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =
        abs (a.x - b.x) + abs (a.y - b.y)
...which is a valid and correct OCaml definition of this method:

https://try.ocamlpro.com/#code/type'point'='$4'x:'int;'y':'i...

esjeon · 3 years ago
I hate to admit, but Python, C, Java, and JS cover most of the modern programming. But not supporting C# sounds like a bad idea.
Dayshine · 3 years ago
C# was available in the dataset they link, and is the most glaring ommission by global usage...
mclide · 3 years ago
Despite the lack of examples, it still completes trivial clojure like "(defn connect [" and other lisp syntax like "(define (hello" which is promising for further refinement training on Lisp languages.
ebiester · 3 years ago
I'm sure that has to do with the dataset available to them.
runnerup · 3 years ago
Which is a deduplicated version of this: https://www.bigcode-project.org/docs/about/the-stack/

And probably, yes. While it contains 358 programming languages, obviously there's a long tail after the 20 most-represented languages. Some people might not expect without thinking about it for a bit that many of the most-represented "languages" are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, SVG.

Also note that it won't be able to parse natural language nearly as well without additionally being trained on something like the LAION dataset, so this version will be more of an autocomplete like Copilot rather than something which can manifest high level business logic from whole cloth like ChatGPT.

sitkack · 3 years ago
You could take it and finetune it on a bunch of Lisps, probably cost on the order of 50-500 to do that.
swyx · 3 years ago
if anyone from MosaicML is reading this, i’d love a guide on how to do exactly this!
chaxor · 3 years ago
This is a bit hard to believe that the system is decent at producing code which captures complex ideas and higher level structure when the tokens/param value is >30 (it's ~200 here? ) The 'good' models (meaning having lots of 'knowledge' or 'memorization' about the dataset) typically tend to be around 2 tokens/param and models with decent generation of language with less knowledge/memorization are around 30 tokens/param. Perhaps the domain allows for this, but due to the fact that the linguistic interface on the input is still needed... It's hard to believe.
gnramires · 3 years ago
Tokens/param shouldn't matter more than the total training FLOPs, I believe. Clearly if we train a your claimed 'ideal' 2 tokens/param a very small dataset (not many tokens in the first place), it wouldn't have enough data to properly learn the relevant languages. Once there is enough data, then it becomes a question of model capacity (does it have enough degrees of freedom to support the computational structures needed?).

I believe the overparametrization largely helps with generalization and reducing overfitting, at 2 tokens/param there's much more degrees of freedom than structures that can be learned from what I can tell (the extra capacity just provides good breathing room for internal structures). But if your model has enough capacity, and you can find a good enough training method (and you have enough data to learn the task), then you should be able to succeed in arbitrary low tokens/param, which is good to keep in mind to make efficient models.

swyx · 3 years ago
this kind of critical thinking is exactly what replit is going to need for their stated goal of doing whole-app generation. right now they only test it on AmjadEval. you… might wanna consider joining them to work on it?
chaxor · 3 years ago
I'm not sure noticing tokens/params or simplicial modeling properties occur requires much critical thought - perhaps it's just a standard first thought for anyone given an LM now. I've worked tangentially to NLP for about 7 years in academia, but most of my work has been focused on a less known field of mathematics applied to either to outputs or to the NNs, as well as bioinformatics. As such, my expertise may not be as refined as the real players in the field such as Glaese, Finn, Velockovic, etc. but I try typically keep up with the actual key advancements in the field (usually the stuff few people notice). This area takes far too much compute capability for many people to actually become experts in it, so much of my interests have been less on large LMs and more towards algorithms. But I suppose I agree that it is frustrating to see how little knowledge many of the hype-filled crowds possess that are piling into this area. (Not calling anyone specifically out in this thread, just in general across the internet)
EvgeniyZh · 3 years ago
Are you saying the less you train the model the better it is? I'm confused
swyx · 3 years ago
i believe GP is referencing the Kaplan and Chinchilla scaling laws. we reference those in the podcast but i’m not sure if some deeper insight is being hinted at here where different scaling laws apply for different domains/purposes