Readit News logoReadit News
Posted by u/grbsh a year ago
Show HN: Repogather – copy relevant files to clipboard for LLM coding workflowsgithub.com/gr-b/repogathe...
Hey HN, I wanted to share a simple command line tool I made that has sped up and simplified my LLM assisted coding workflow. Whenever possible, I’ve been trying to use Claude as a first pass when implementing new features / changes. But I found that depending on the type of change I was making, I was spending a lot of thought finding and deciding which source files should be included in the prompt. The need to copy/paste each file individually also becomes a mild annoyance.

First, I implemented `repogather --all` , which unintelligently copies all sources files in your repository to the clipboard (delimited by their relative filepaths). To my surprise, for less complex repositories, this alone is often completely workable for Claude — much better than pasting in the just the few files you are looking to update. But I never would have done it if I had to copy/paste everything individually. 200k is quite a lot of tokens!

But as soon as the repository grows to a certain complexity level (even if it is under the input token limit), I’ve found that Claude can get confused by different unrelated parts / concepts across the code. It performs much better if you make an attempt to exclude logic that is irrelevant to your current change. So I implemented `repogather "<query here>"` , e.g. `repogather "only files related to authentication"` . This uses gpt-4o-mini with structured outputs to provide a relevance score for each source file (with automatic exclusions for .gitignore patterns, tests, configuration, and other manual exclusions with `--exclude <pattern>` ).

gpt-4o-mini is so cheap and fast, that for my ~8 dev startup’s repo, it takes under 5 seconds and costs 3-4 cents (with appropriate exclusions). Plus, you get to watch the output stream while you wait which always feels fun.

The retrieval isn’t always perfect the first time — but it is fast, which allows you to see what files it returned, and iterate quickly on your command. I’ve found this to be much more satisfying than embedding-search based solutions I’ve used, which seem to fail in pretty opaque ways.

https://github.com/gr-b/repogather

Let me know if it is useful to you! Always love to talk about how to better integrate LLMs into coding workflows.

faangguyindia · a year ago
I usually only edit 1 function using LLM on old code base.

On Greenfield projects. I ask Claude Soñnet to write all the function and their signature with return value etc..

Then I've a script which sends these signature to Google Flash which writes all the functions for me.

All this happens in paraellel.

I've found if you limit the scope, Google Flash writes the best code and it's ultra fast and cheap.

grbsh · a year ago
Interesting - isn't Google Flash worse at coding than Sonnet 3.5? I subscribe to to Claude for $20/m, but even if the API were free, I'd still want to use the Claude interface for flexibility, artifacts and just understandability, which is why I don't use available coding assistants like Plandex or Aider.

What if you need to iterate on the functions it gives? Do you just start over a with a different prompt, or do you have the ability to do a refinement with Google Flash on existing functions?

faangguyindia · a year ago
Claude Soñnet is a more creative coder.

That's why Gemini Flash might appear dumb in front of Sonnet. But who writes the dumb functions better which are guaranteed to work for long time in production? Gemini.

But Sonnet makes silly mistakes like, even when I feed it requirements.txt it still uses methods which either do not exist or used to exist but not anymore.

Gemini Flash isn't as creative.

So basically, we use Sonnet to do high level programming and Flash for low level (writing functions which are guaranteed to be correct and clean, no black magic)

Problem with sonnet is it's slow. Sometimes you'll be stuck in a loop where it suggests something, then removes it when it encounters errors, then it again suggests the vary same thing you tried before.

I am using Claude Soñnet via cursor.

>What if you need to iterate on the functions it gives?

I can do it via Aider and even modify the prompt it sends to Gemini Flash.

JackYoustra · a year ago
Do you have the script?
mrtesthah · a year ago
This symbolic link broke it:

srtp -> .

  File "repogather/file_filter.py", line 170, in process_directory
    if item.is_file():
       ^^^^^^^^^^^^^^
OSError: [Errno 62] Too many levels of symbolic links: 'submodules/externals/srtp/include/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp'

grbsh · a year ago
Thanks for letting me know - I’ll make sure it can support (transitively) circular symlinks soon.
reacharavindh · a year ago
Do you literally paste a wall of text (source code of the filtered whole repo) into the prompt and ask the LLM to give you a diff patch as an answer to your question?

Example,

Here is my whole project, now implement user authentication with plain username/password?

grbsh · a year ago
Yes! And I fought the urge to do this for so long, I think because it _feels_ wasteful for some reason? But Claude handles it like a champ*, and gets me significantly better and easier results than if I manually pasted a file in a described the rest of the context it needs by hand.

* Until the repository gets more complicated, which is why we need the intelligent relevance filtering features of repogather, e.g. `repogather "Only files related to authentication and avatar uploads"`

ukuina · a year ago
Yes? I mean, it works for small projects.
punkpeye · a year ago
Yes
reidbarber · a year ago
Nice! I built something similar, but in the browser with drag-and-drop at https://files2prompt.com

It doesn’t have all the fancy LLM integration though.

fellowniusmonk · a year ago
This looks very cool for complex queries!

If your codebase is structured in a very modular way than this one liner mostly just works:

find . -type f -exec echo {} \; -exec cat {} \; | pbcopy

grbsh · a year ago
I like this! I originally started with something similar (but this one is much cleaner!), but then wanted to add optional exclusions (like .gitignore, tests, configurations).

Would it be okay if I include this one liner in the readme (with credit) as an alternative?

fellowniusmonk · a year ago
Absolutely!
smcleod · a year ago
There's so many of these popping up! Here's mine - https://github.com/sammcj/ingest
jondwillis · a year ago
In this thread: nobody using Cursor, embedding documentation, using various RAG techniques…
grbsh · a year ago
Cursor doesn’t fit into everyone’s workflow — I subscribe to it, but I’ve found myself preferring the Claude UI for various reasons.

Part of it is that I actually get better results using repogather + Claude UI for asking questions about my code than I get with Cursor’s chat. I suspect the index it creates on my codebase just isn’t very good, and it’s opaque to me.

ukuina · a year ago
It's fascinating to see how different frameworks are dealing with the problem of populating context correctly. Aider, for example, asks users to manually add files to context. Claude Dev attempts to grep files based on LLM intent. And Continue.dev uses vector embeddings to find relevant chunks and files.

I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

grbsh · a year ago
I've been frustrated with embedding search approaches, because when they fail, they fail opaquely -- I don't know how to iterate on my query in order to get close to what I expected. In contrast, since repogather merely wraps your query in a simple prompt, it's easier to intuit what went wrong, if the results weren't as you expected.

> I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

I've been extremely interested in this question! Will be interesting to see how things develop, but I suspect that relevance filtering is not as difficult as coding, so small, cheap LLMs will make the former a solved, inexpensive problem, while we will continue to build larger and more expensive LLMs to solve the latter.

That said, you can buy a lot of tokens for $150k, so this could be short sighted.

doctoboggan · a year ago
I am really happy with how Aider does it as it feels like a happy medium. The state of LLMs these days means you really have to break the problem down to digestible parts and if you are doing that, its not much more work to specify the files that need to be edited in your request. Aider can also prompt you to add a file if it thinks that file needs to be edited.
taneq · a year ago
I haven't looked into this but do any of them use modern IDE code inspection tools? I'd think you would dump as much "find references" and "show definition" outputs for relevant variables into context as possible.
faangguyindia · a year ago
Aider also uses AST (tree sitter), it creates repo map using and sends it to LLM.
grbsh · a year ago
I like this approach a lot, especially because it's not opaque like embeddings. Maybe I can add an option to use this approach instead with repogather, if you are cost sensitive.
jadbox · a year ago
How does Aider (cli) compare to Claude Dev (VSCode plugin)? Anyone have a subjective analysis?
avernet · a year ago
And how does repogather do it? From the README, it looks to me like it might provide the content of each file to the LLM to gauge its relevance. But this would seem prohibitively expensive on anything that isn't a very small codebase (the project I'm working on has on the order of 400k SLOC), even with gpt-4o-mini, wouldn't it?
grbsh · a year ago
repogather indeed as a last step stuffs everything not otherwise excluded through cheap heuristics into gpt-4o-mini to gauge relevance, so it will get expensive for large projects. On my small 8 dev startup's repo, this operation costs 2-4 cents. I was considering adding an `--intelligence` option, where you could trade off different methods between cost, speed, and accuracy. But, I've been very unsatisfied with both embedding search methods, and agentic file search methods. They seem to regularly fail in very unpredictable ways. In contrast, this method works quite well for the projects I tend to work on.

I think in the future as the cost of gpt-4o-mini level intelligence decreases, it will become increasingly worth it, even for larger repositories, to simply attend to every token for certain coding subtasks. I'm assuming here that relevance filtering is a much easier task than coding itself, otherwise you could just copy/paste everything into the final coding model's context. What I think would make much more sense for this project is to optimize the cost / performance of a small LLM fine-tuned for this source relevance task. I suspect I could do much better than gpt-4o-mini, but it would be difficult to deploy this for free.

punkpeye · a year ago
Continue.dev approach sounds like it would provide the most relevant code?
morgante · a year ago
Embeddings are actually generally not that effective for code.

Deleted Comment