Show HN: GPT Repo Loader – load entire code repos into GPT prompts

This is awesome, can't wait to get api access to the 32k token model. Rather than this approach of just converting the whole repo to a text file, what I'm thinking is, you can let the model decide the most relevant files.

The initial prompt would be, "person wants to do x, here are the file list of this repo: ...., give me a list of files that you'd want to edit, create or delete" -> take the list, try to fit the contents of them into 32k tokens and re-prompt with "user is trying to achieve x, here's the most relevant files with their contents:..., give me a git commit in the style of git patch/diff output". From playing around with it today, I think this approach would work rather well and can be like a huge step up from AI line autocompletion.

wantsanagent · 3 years ago

Please see the following repos for tools in this area:

https://github.com/jerryjliu/llama_index

and/or

https://github.com/hwchase17/langchain

groestl · 3 years ago

Maybe someone can correct me, but my understanding is that you would calculate the embeddings of code chunks, and the embedding of the prompt, and take those chunks that are most similar to the embedding of the prompt as context.

Edit: This, btw, is also the reason why I think that this here popped up on the hackernews frontpage a short while ago: https://github.com/pgvector/pgvector

michaelmior · 3 years ago

This sounds like a reasonable start. Eventually we need to get to the point where we can expose an API for models to request additional information on their own.

alooPotato · 3 years ago

It's just so slow for the autocompletion use case to do it like that. Ideally, you're never chaining serial requests to the LLM. Even if you do stuff in all the data into a single prompt, the execution time seems to be superlinear with the number of tokens, again getting super slow.

ftufek · 3 years ago

Yeah I agree it's too slow for autocompletion at the moment, but this would be for full feature implementations, not just autocomplete. For example, if I have a repo I want to add a table and rest api implementation in, it can do this: https://imgur.com/a/mIJvaJr (ignore the formatting errors in the UI, somehow parts of it show up in as code and others not, but api wouldn't have this issue, especially since you can use the system message to enforce output format).

I'm happy to wait even 30-60 seconds for this which I can easily evaluate, criticize (and the model will correct it) and then proceed to just patch and move on. I think the results from this will be much better with the 32k model, but remains to be seen.

visarga · 3 years ago

Working with GPT becomes like coding in plain English.

marginalia_nu · 3 years ago

There's a reason we don't code in plain English though. Natural language has ambiguities. This is the reason we invented programming languages.

It's best illustrated by the old joke:

  A programmer's wife told him "Go to the store and buy milk and if they have eggs, get a dozen." He came back a while later with 12 cartons of milk.

A good chunk of all bugs in software are down to the requirements being insufficiently well specified. Further, many bugs are the discovery of new requirements when informal specification encounters reality.

"Read from standard input into this byte array" doesn't specify what to do when the input exceeds the byte array.

When you overflow the buffer, you get a "well obviously you're supposed to not do that"... that's wasn't stated at all.

When the function keeps going after a newline or a null byte or whatever, there's another "well obviously you're supposed to stop at those points". That was also not specified.

and so on.

At the point you're specifying all these cases and what to do when, it's so specific and stilted, you might as well be using a programming language.

victorbjorklund · 3 years ago

And we are back with Cobol haha full circle

wongarsu · 3 years ago

You would probably should also include all relevant imports. So in C/C++ add all non-standard headers referenced by those files, in other languages simulate the import system, maybe pruning imported files to just the important parts (type definitions etc)

capableweb · 3 years ago

AFAIK, only the older models allows you to do fine tuning, not sure GPT4 will allow to create your own fine tuned model so basically with the API it will work the same as with the chat gui.

wokwokwok · 3 years ago

Just remember the API charge is 6c per input token [1]. If you push 32k input tokens in, you're looking at $2000 per API call just as input.

You... might wanna consider a self hosted alternative for that use case, or at least do like, a `| wc` to get an idea of what you're potentially sending before calling the api.

[1] - https://help.openai.com/en/articles/7127956-how-much-does-gp...

dangond · 3 years ago

6c per thousand tokens, so $2 per maxed out API call

Deleted Comment

Eventually you get to a point where the AI simply doesn’t know how to do something correctly. It can’t fix a certain bug, or implement a feature correctly. At this point you are left trying to do more and more prompt crafting… or you can just fix the problem yourself. If you can’t do it yourself, you’re screwed.

I wonder if the future will just be software hobbled together with shitty AI code that no one understands, with long loops and deep call stacks and abstractions on top of abstractions, while tech priests take a prompt and pray approach to eventually building something that does kind of what they want.

Or to hell with priests! Build some temple where users themselves can come leave prompts for the general AI to hear and maybe put out a fix for some app running on their tablets devices.

Bjartr · 3 years ago

> I wonder if the future will just be software hobbled together with shitty AI code that no one understands, with long loops and deep call stacks and abstractions on top of abstractions

There's plenty of software out there that fits this description if you just remove "AI" from the statement. There's nothing new about bad codebases. Now it just costs pennies and is written in seconds instead of thousands paid to an outsourcing middleman firm that takes weeks or months to turn it around.

greesil · 3 years ago

At least the shitty software might have unit tests

Deleted Comment

tobiasSoftware · 3 years ago

I don't see programming going away for this reason. Think about it, if you have to carefully describe what you want to do to an AI - you are just writing a program. Only a program is deterministic and will do what you tell it to, whereas an AI may or may not.

The future that I see, coding and AI are divided into two camps. The one is what we would call "script kiddies" today - people who don't understand how to write software, but know enough to ask the right questions and bodge what they get together into something that mostly works. The other camp would be programmers who are similar to programmers today, but use AI to write boilerplate for them, as well as replace Stack Overflow.

osigurdson · 3 years ago

I'm already at the point where I get frustrated when GPT writes some incorrect code - despite it saving me enormous amounts of time. The appetite for productivity seems to be insatiable. I want to be able to create a million lines of code per month by myself.

antibasilisk · 3 years ago

Between GPT-3 and GPT-4 the precision required for prompts was decreased significantly. In theory it should reach the point where a project manager type person would be able to describe what is needed and it would simply do it, the main thing missing from attaining that is that GPT-4 basically never responds to questions with requests for clarification, otherwise, a whole team of developers could be reduced to just one proofreader.

hackerlight · 3 years ago

The problem is the unfounded assumption that the code will be shitty 10 years from now. You really want to make that bet given the extreme velocity of recent progress? My expectation is that the code will actually be good.

layer8 · 3 years ago

> software hobbled together with shitty AI code that no one understands

That’s my fear as well. Software may just get shittier overall (aka “good enough”), and in higher volumes, due to it taking less time to crank out using AI.

ChatGTP · 3 years ago

I think it's going to be the end of libraries, why would people bother with libraries when every service can be churned out in 5 seconds with ChatGPT ?

We're going to witness he greatest copy pasta in history and find out how that goes.

amq · 3 years ago

Most of the software will be GPT itself.

thingification · 3 years ago

Yes, aren't we under-focusing a bit, right now, on the goals that in the past were achieved by writing code but will in the future be achieved by LLMs themselves (and their replacements)?

For example, maybe we write some code now with the goal of helping a customer service person do their job. But we all know that plenty of people are trying to replace customer service people with LLMs, not use LLMs to write tools to help customer service people.

I see that the LLM still needs to know what's going on with the customer account, and maybe for a long time that takes the form of conventional APIs. But surely something is going to change here?

xwdv · 3 years ago

jerpint · 3 years ago

“This repo is GPL-v3 licensed. Rewrite it while preserving its main functionality”

EMIRELADERO · 3 years ago

This is already legal without AI. Copyright protects only expression, not ideas, systems or methods. This is why directly reverse-engineering a proprietary binary to extract the algorithms and systems is legal.

zarzavat · 3 years ago

Indeed, but it’s a much more legally dubious proposition when it comes to entire repos. A repo has more potentially creative structure for copyright to attach to. For example the class graph, or the filesystem layout are creative decisions that could potentially be protected. Current LLMs are nowhere near powerful enough to reimplement an entire repo without violating copyright.

For an individual function I can totally believe GPT4 could strip creative expression from it today. For example you could ask it to give a detailed description of a function in English, and then feed that English description back in (in a new session) and ask it to generate a code based upon the description.

cornholio · 3 years ago

> Copyright protects only expression, not ideas, systems or methods

Copyright is a law agreed by a humans in a social contract created to protect humans and further their interests in a 'fair' manner. There is no inalienable right to copyright, no universal law that requires it, it's not an emergent property of intelligence that mechanically applies to artificial entities.

So while the current copyright laws could be interpreted in the way you suggest for the time being, they are clearly written without any notion of AI, and can and should be revised to incorporate the new state of the world; you can bet creators will push hard in that direction. It's pretty clear that the mechanical transformation of a human body of work for the sole purpose of stripping it of copyright is a violation of the spirit of copyright law *.

*( as long as that machine can't also generate a similar work from scratch, in which case the point becomes moot. But we are far, far, from that point)

PartiallyTyped · 3 years ago

For anyone interested, this is called "clean room design"[1]. Unfortunately it doesn't protect against patents, but it does against copyright.

[1] https://en.wikipedia.org/wiki/Clean_room_design

itsnotlupus · 3 years ago

New GPT-4-powered business model just dropped.

teaearlgraycold · 3 years ago

Nice!

DaiPlusPlus · 3 years ago

Rather than prompting GPT into implementing a solution, can we prompt it to try to preemptively find issues with the codebase or missing-functionality?

Also, do we know what languages GPT-4 "understands" at a sufficient level? What knowledge does it have of post-2021 language features, like in C23?

It has no post-2021 knowledge, but while playing with it, I found that you can just paste the documentation (no need to even format it) and it'll just "learn" it. For example, safetensors wasn't available back then apparently, I just copied the docs into it and was able to get it write pretty good pytorch code that incorporates safetensors.

kanyethegreat · 3 years ago

i just asked it about safetensors today! also, got a response that amounted to "i don't know what that is. i'm guessing it's X"

textninja · 3 years ago

I imagine few shot learning would kick in for most new language features. A feature may be new to a particular language, but is it really new?

abecedarius · 3 years ago

I tried a new Lisp dialect on it, my own hobby language. It could cope well given explanations initially, but with some degradation after a while. The full transcript went to 84kB, so it must be doing some kind of intelligent summarization behind the scenes to stay as coherent as it did, right? (The standard context window is supposed to be 8k tokens.)

(https://gist.github.com/darius/b463c7089358fe138a6c29286fe2d... paste in painful-to-read format if anyone's really curious. In three parts: intro to language; I ask it to code symbolic differentiation; then a metacircular interpreter.)

adltereturn · 3 years ago

I am skeptical about using the method of generating a large amount of repository data and sending it, because there may be too many files in the repository. I think a better approach might be for OpenAI to open an interface for transferring GIT repositories, and then let OpenAI analyze the repository data, which is similar to what [chatpdf](https://www.chatpdf.com/) is doing.

EGreg · 3 years ago

How much text can you feed GPT-4?

Our codebase is 1 million lines of code.

Can we feed the documentation to it? What are the limits?

Is it possible to train it on our data without doing prompt engineering? How?

Otherwise are we supposed to use embeddings? Can someone explain how these all work and the tradeoffs?

lukasb · 3 years ago

I've been wondering if you could use something like llama-chain's tree summarization, but modified to be aware of inter-module dependencies: https://gpt-index.readthedocs.io/en/latest/guides/index_guid...

mpoon · 3 years ago

I'm waiting on my GPT-4 API access so I can use gpt-4-32k which maybe can soak up 10k LOC?

Clearly this will break eventually, but I am playing around with some ideas to extend how much context I can give it. One is to do something like base64 encode file contents. I've seen some early success that GPT-4 knows how to decode it, so that'll allow me to stuff more characters into it. I'm also hoping that with the use of .gptignore, I can just selectively give the files I think are relevant for whatever prompt I'm writing.

> GPT-4 knows how to decode it

I wonder if you could teach it to understand a binary encoding using the raw bytestream, feed it compressed text, and just tell it to decompress it first.

dc-programmer · 3 years ago

Base64 encoding increases the size of text by 4/3. Like the other commenter asked, I wonder if another encoding could work

WXLCKNO · 3 years ago

Have multiple instances of Gpt4 with different parts of the codebase interact with each other to write the whole thing.

Probably doesn't work this way lol

blurbleblurble · 3 years ago

32k tokens is the limit, so you won't be able to load the whole thing into the context.

nico · 3 years ago

For an alternative, you can use LangChain.

Unfortunately GPT is not yet aware of what LangChain is or how it works, and the docs are too long to feed the whole thing to GPT.

But you can still ask it to figure something out for you.

For example: “write pseudo-code that can read documents in chunks of 800 tokens at a time, then for each chunk create a prompt for GPT to summarize the chunk, then save the responses per document and finally aggregate+summarize all the responses per document”

Basically a kind of recursive map/reduce process to get, process and aggregate GPT responses about the data.

LangChain provides tooling to do the above and even allow the model to use tools, like search or other actions.

GPT 4 is limited currently to 8k tokens, which is about 6000 words.

You can use our repo (which we are currently updating to include QuickStart tutorials, coming in the next few days) to do embedding retrieval and query

www.GitHub.com/Jerpint/buster

freezed88 · 3 years ago

This is what llama-index was designed for! https://gpt-index.readthedocs.io/en/latest/ Would love to incorporate this Github repo loader into LlamaHub

waynenilsen · 3 years ago

Fully automated junior swe but on hyper speed. Natural next step

How will we grow new senior SWEs in the future?

Software engineering will never be the same. The LLM will teach its users about programming. This is the worst LLM tech will ever be. That is an incredible statement. The rate of error will decrease to near zero or at the very least significantly better than human. Universities will resist at first but the new tools that will emerge will be core curriculum at university. Just as I now don't use the pumping lemma day to day, programmers of the future will not write code. They will primarily review and eventually AI systems will adversarially review and programmers will do final review.

All programmers will become translators from product vision to architecture implementation via guided code review. Eventually this gap will also be closed. Product will say: Make a website that aggregates powerlifting meet dates and keeps them up to date. Deploy it. Use my card on file. Don't spend more than $100/month. The AI will execute the plan.

Programmers will come in when product can't figure out what's wrong with the system.

stubybubs · 3 years ago

You must kill one in hand to hand combat before you can take their place.

ajmurmann · 3 years ago

Given how quickly everything around generative AI has been evolving, would your money be on a new junior SWE becoming a senior SWE first or LLM tooling gaining senior SWE capabilities first?

exit · 3 years ago

just start new instances from the base image or useful checkpoints

"The Age of Em" by Robin Hanson thinks through a lot of this in great depth

In a pod.

swyx · 3 years ago

61 LOC for implementation, 42 LOC for tests.

this repo currently has more HN upvotes than LOC.

very high leverage code!

wahnfrieden · 3 years ago

The code does less than the title appears to claim - it's a simple concatenation of files into one text file. Luckily GPT itself is high leverage and that's all you need.

jeremy_k · 3 years ago

https://github.com/mpoon/gpt-repository-loader/pull/17/ If you look at this PR, he had ChatGPT write the tests for him.

He wrote the issue on https://github.com/mpoon/gpt-repository-loader/issues/16 and summarized https://github.com/mpoon/gpt-repository-loader/discussions/1...

"Open an issue describing the improvement to make Construct a prompt - start with using gpt_repository_loader.py on this repo to generate the repository context, then append the text of the opened issue after the --END-- line."

Feels like it needs to add a little Github client to be able to automatically append the text of issues at the end of the output. I'm sure ChatGPT can write a Github client in Python no problem.

VadimPR · 3 years ago

I've got a hunch that AI will be able to put Hyrum's Law (https://www.hyrumslaw.com) to good use in the future: given an application, generate unit tests for all documented and undocumented behaviours of the system. Do all the refactoring you need afterwards and you'll have a large safety net backing you up. With refactoring complete, regenerate unit tests for all new behaviours of the system.