Show HN: Repogather – copy relevant files to clipboard for LLM coding workflows

Show HN: Repogather – copy relevant files to clipboard for LLM coding workflows github.com/gr-b/repogathe...

Hey HN, I wanted to share a simple command line tool I made that has sped up and simplified my LLM assisted coding workflow. Whenever possible, I’ve been trying to use Claude as a first pass when implementing new features / changes. But I found that depending on the type of change I was making, I was spending a lot of thought finding and deciding which source files should be included in the prompt. The need to copy/paste each file individually also becomes a mild annoyance.

First, I implemented `repogather --all` , which unintelligently copies all sources files in your repository to the clipboard (delimited by their relative filepaths). To my surprise, for less complex repositories, this alone is often completely workable for Claude — much better than pasting in the just the few files you are looking to update. But I never would have done it if I had to copy/paste everything individually. 200k is quite a lot of tokens!

But as soon as the repository grows to a certain complexity level (even if it is under the input token limit), I’ve found that Claude can get confused by different unrelated parts / concepts across the code. It performs much better if you make an attempt to exclude logic that is irrelevant to your current change. So I implemented `repogather "<query here>"` , e.g. `repogather "only files related to authentication"` . This uses gpt-4o-mini with structured outputs to provide a relevance score for each source file (with automatic exclusions for .gitignore patterns, tests, configuration, and other manual exclusions with `--exclude <pattern>` ).

gpt-4o-mini is so cheap and fast, that for my ~8 dev startup’s repo, it takes under 5 seconds and costs 3-4 cents (with appropriate exclusions). Plus, you get to watch the output stream while you wait which always feels fun.

The retrieval isn’t always perfect the first time — but it is fast, which allows you to see what files it returned, and iterate quickly on your command. I’ve found this to be much more satisfying than embedding-search based solutions I’ve used, which seem to fail in pretty opaque ways.

https://github.com/gr-b/repogather

Let me know if it is useful to you! Always love to talk about how to better integrate LLMs into coding workflows.

It's fascinating to see how different frameworks are dealing with the problem of populating context correctly. Aider, for example, asks users to manually add files to context. Claude Dev attempts to grep files based on LLM intent. And Continue.dev uses vector embeddings to find relevant chunks and files.

I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

grbsh · a year ago

I've been frustrated with embedding search approaches, because when they fail, they fail opaquely -- I don't know how to iterate on my query in order to get close to what I expected. In contrast, since repogather merely wraps your query in a simple prompt, it's easier to intuit what went wrong, if the results weren't as you expected.

> I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

I've been extremely interested in this question! Will be interesting to see how things develop, but I suspect that relevance filtering is not as difficult as coding, so small, cheap LLMs will make the former a solved, inexpensive problem, while we will continue to build larger and more expensive LLMs to solve the latter.

That said, you can buy a lot of tokens for $150k, so this could be short sighted.

doctoboggan · a year ago

I am really happy with how Aider does it as it feels like a happy medium. The state of LLMs these days means you really have to break the problem down to digestible parts and if you are doing that, its not much more work to specify the files that need to be edited in your request. Aider can also prompt you to add a file if it thinks that file needs to be edited.

taneq · a year ago

I haven't looked into this but do any of them use modern IDE code inspection tools? I'd think you would dump as much "find references" and "show definition" outputs for relevant variables into context as possible.

faangguyindia · a year ago

Aider also uses AST (tree sitter), it creates repo map using and sends it to LLM.

grbsh · a year ago

I like this approach a lot, especially because it's not opaque like embeddings. Maybe I can add an option to use this approach instead with repogather, if you are cost sensitive.

jadbox · a year ago

How does Aider (cli) compare to Claude Dev (VSCode plugin)? Anyone have a subjective analysis?

avernet · a year ago

And how does repogather do it? From the README, it looks to me like it might provide the content of each file to the LLM to gauge its relevance. But this would seem prohibitively expensive on anything that isn't a very small codebase (the project I'm working on has on the order of 400k SLOC), even with gpt-4o-mini, wouldn't it?

grbsh · a year ago

repogather indeed as a last step stuffs everything not otherwise excluded through cheap heuristics into gpt-4o-mini to gauge relevance, so it will get expensive for large projects. On my small 8 dev startup's repo, this operation costs 2-4 cents. I was considering adding an `--intelligence` option, where you could trade off different methods between cost, speed, and accuracy. But, I've been very unsatisfied with both embedding search methods, and agentic file search methods. They seem to regularly fail in very unpredictable ways. In contrast, this method works quite well for the projects I tend to work on.

I think in the future as the cost of gpt-4o-mini level intelligence decreases, it will become increasingly worth it, even for larger repositories, to simply attend to every token for certain coding subtasks. I'm assuming here that relevance filtering is a much easier task than coding itself, otherwise you could just copy/paste everything into the final coding model's context. What I think would make much more sense for this project is to optimize the cost / performance of a small LLM fine-tuned for this source relevance task. I suspect I could do much better than gpt-4o-mini, but it would be difficult to deploy this for free.

punkpeye · a year ago

Continue.dev approach sounds like it would provide the most relevant code?

morgante · a year ago

Embeddings are actually generally not that effective for code.

Deleted Comment

I usually only edit 1 function using LLM on old code base.

On Greenfield projects. I ask Claude Soñnet to write all the function and their signature with return value etc..

Then I've a script which sends these signature to Google Flash which writes all the functions for me.

All this happens in paraellel.

I've found if you limit the scope, Google Flash writes the best code and it's ultra fast and cheap.

Interesting - isn't Google Flash worse at coding than Sonnet 3.5? I subscribe to to Claude for $20/m, but even if the API were free, I'd still want to use the Claude interface for flexibility, artifacts and just understandability, which is why I don't use available coding assistants like Plandex or Aider.

What if you need to iterate on the functions it gives? Do you just start over a with a different prompt, or do you have the ability to do a refinement with Google Flash on existing functions?

Claude Soñnet is a more creative coder.

That's why Gemini Flash might appear dumb in front of Sonnet. But who writes the dumb functions better which are guaranteed to work for long time in production? Gemini.

But Sonnet makes silly mistakes like, even when I feed it requirements.txt it still uses methods which either do not exist or used to exist but not anymore.

Gemini Flash isn't as creative.

So basically, we use Sonnet to do high level programming and Flash for low level (writing functions which are guaranteed to be correct and clean, no black magic)

Problem with sonnet is it's slow. Sometimes you'll be stuck in a loop where it suggests something, then removes it when it encounters errors, then it again suggests the vary same thing you tried before.

I am using Claude Soñnet via cursor.

>What if you need to iterate on the functions it gives?

I can do it via Aider and even modify the prompt it sends to Gemini Flash.

JackYoustra · a year ago

Do you have the script?

mrtesthah · a year ago

This symbolic link broke it:

srtp -> .

  File "repogather/file_filter.py", line 170, in process_directory
    if item.is_file():
       ^^^^^^^^^^^^^^

OSError: [Errno 62] Too many levels of symbolic links: 'submodules/externals/srtp/include/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp'

Thanks for letting me know - I’ll make sure it can support (transitively) circular symlinks soon.

reacharavindh · a year ago

Do you literally paste a wall of text (source code of the filtered whole repo) into the prompt and ask the LLM to give you a diff patch as an answer to your question?

Example,

Here is my whole project, now implement user authentication with plain username/password?

Yes! And I fought the urge to do this for so long, I think because it _feels_ wasteful for some reason? But Claude handles it like a champ*, and gets me significantly better and easier results than if I manually pasted a file in a described the rest of the context it needs by hand.

* Until the repository gets more complicated, which is why we need the intelligent relevance filtering features of repogather, e.g. `repogather "Only files related to authentication and avatar uploads"`

ukuina · a year ago

Yes? I mean, it works for small projects.

Yes

reidbarber · a year ago

Nice! I built something similar, but in the browser with drag-and-drop at https://files2prompt.com

It doesn’t have all the fancy LLM integration though.

fellowniusmonk · a year ago

This looks very cool for complex queries!

If your codebase is structured in a very modular way than this one liner mostly just works:

find . -type f -exec echo {} \; -exec cat {} \; | pbcopy

I like this! I originally started with something similar (but this one is much cleaner!), but then wanted to add optional exclusions (like .gitignore, tests, configurations).

Would it be okay if I include this one liner in the readme (with credit) as an alternative?

Absolutely!

smcleod · a year ago

There's so many of these popping up! Here's mine - https://github.com/sammcj/ingest

jondwillis · a year ago

In this thread: nobody using Cursor, embedding documentation, using various RAG techniques…

Cursor doesn’t fit into everyone’s workflow — I subscribe to it, but I’ve found myself preferring the Claude UI for various reasons.

Part of it is that I actually get better results using repogather + Claude UI for asking questions about my code than I get with Cursor’s chat. I suspect the index it creates on my codebase just isn’t very good, and it’s opaque to me.