Lessons from building GitHub code search [video]

They list things like "Fast" and "Regular Expressions", but what about correct results? I often use GitHub web search and it never finds terms which I know for a fact exist. After it fails to return results, I have to instead manually navigate to the file containing the search term and yes it's there.

100k · 2 years ago

(I gave this talk and work on code search.)

Sorry you've had a frustrating experience with the product. When we look into complaints about missing results, it's almost always one of two things. First, the repository is not indexed yet, but searching triggers indexing. Future searches work. Second, the files or the repo hit our documented limitations (https://docs.github.com/en/search-github/github-code-search/...). I think these limits are pretty reasonable, but we need to do a better job making it clear why content isn't in the index.

prepend · 2 years ago

> First, the repository is not indexed yet

I understand this is important. But the issue I have is that it’s hard or maybe impossible to know what’s been indexed and what isn’t.

I run a few orgs with hundreds of repos. Which are indexed? I don’t know.

This makes your search suck for my organization. I understand the reasons. They aren’t reasonable for me. I don’t want to search using your tool if it won’t work for my org.

Code search isn’t just for what’s popular. It needs to be for what is real and accurate.

degenerate · 2 years ago

When a user search includes a non-indexed repository, GitHub needs to include a warning message along with the search results, something similar to what you just mentioned:

"One or more repositories of this search is not yet indexed, please try your search later for accurate results."

Likewise for the 2nd case.

kadoban · 2 years ago

What does "Exhaustive search is not supported" mean? (that phrase is from your link, in the bullet-points near the top)

quartz · 2 years ago

The new UX does a really bad job of helping you realize when you're logged out and search results are being suppressed. It very much implies "zero results" when in fact it means to say "zero results, except in code, where there might be results, but you need to log in first to find out".

welder · 2 years ago

Sure but I was talking about incorrect results when logged in, and for public repos.

parthdesai · 2 years ago

Weird, I've never found that to be the case. There's a "flash" sort of message right at the top of the search results that says "Sign in to see results in the `x` org"

lettergram · 2 years ago

Yup I was about to write something about how I quite literally clone & grep code bases after GitHub search fails probably half the time.

jve · 2 years ago

For repository wide search

   1. Press dot "."
   2. CTRL + SHIFT + F
   3. Enjoy results powered by rg (ripgrep)

welder · 2 years ago

I just use sourcegraph it's faster than cloning usually.

paulddraper · 2 years ago

GitHub search used to be amazing, then circa 2019(?) they nerfed it, I think because of security credential leakage.

jokethrowaway · 2 years ago

Nah, it's always been unreliable. Around 2016 a coworker wrote a code indexer for our private repos and we used that to find occurrences of things which were critical.

We've been burnt sooo many times ('are you sure we changed all the occurrences of this in all our 3252 microservices?').

Between shoddy UX (the + being part of the code you were copying!), blatantly missing features for years (ZenHub, late GitHub actions), GitHub really succeeded only because of timing and network effect. The developers' community is quite powerful.

I'm happy for the original team, but now it's just another MS acquisition I hope we'll stop using.

I don't know if it is related to the changes to the code search (I use github exclusively as guest), but in the last year github got slower and heavier to the point that even scrolling a page showing a file of 200 lines is very painful. One with more than 2000 lines crashes the tab. I don't have the most powerful machines at my disposal, but what I have should be more than powerful enough to browse a repository. My intuition points at the syntax highligther and the file browser/go-to-symbol as possible causes. Anybody knows of some magic setting I can use to have a better experience? Right now I am forced to avoid using it as much as possible, and clone the repositories to browse them locally.

jamietanna · 2 years ago

There was also a recent rollout of React, of which I've seen a number of complaints about performance related to the timing.

Personally, my Firefox Nightly on Android will regularly crash (along with the whole System UI) when opening / trying to type into a GitHub page

jvans · 2 years ago

the treadmill of perpetual tech migrations only to find out about the downsides after months of work

niij · 2 years ago

Same experience here on Firefox stable on Android. The file editing portion of Github is unusable for files more than a few hundred lines.

inferiorhuman · 2 years ago

The magic setting seems to be don't use Github, unfortunately.

There were a couple GH discussion threads about the dreadful UI rollout, but someone complained one too many times about (code) search returning the wrong results so they've since been locked. For me it's the issue search that seems to consistently return the wrong results, and the text input widgets that are glitchy as hell in Firefox.

Performance-wise it's worth keeping in mind that Github is now rendering all the text client side. If you go through the discussion threads you can see where the GH devs didn't know how to account for multibyte codepoints and how that broke their non-native rendering with things like emoji and Chinese characters.

naikrovek · 2 years ago

all these posts about how it's so obvious that search doesn't work, but no examples that demonstrate it from anyone.

Kuinox · 2 years ago

Oh yeah, they use react now, and have to use smart tricks so you can actually select/search highlighted code. The code is also probably added as you scroll too, using a virtual list.

FridgeSeal · 2 years ago

GitHub has definitely gotten worse.

Stale data, PR’s and such don’t update properly, page loads are glacial, half the time it’s doing some loading bar of its own-no doubt as part of some trying-to-be-clever SPA thing, and it’s awful.

selckin · 2 years ago

In the actions overview, just the spinner icon (svg w/ css animation) takes 25-40% cpu and 10% gpu in my chrome. Enough to keep my fingers very warm on the laptop

3np · 2 years ago

I'm looking into options for syncing issues and PRs for local browsing as well due to the frontend performance regressions in GitHub.

There's a hanging draft for it in Gitea (inb4 yes also Forgejo).

https://github.com/go-gitea/gitea/pull/20311

Maybe this is the push I need to actually start using taskwarrior...

https://github.com/GothenburgBitFactory/bugwarrior

infotogivenm · 2 years ago

I’ve noticed the code reader is worthless on even slightly out of date browsers now, and even on newer browsers it tends to choke and stutter on large files. Sad :( it used to be the best

panzerboiler · 2 years ago

Oh hey, that's my talk! Thanks for submitting it. It was a huge honor to present my team's work at Strange Loop.

rmccue · 2 years ago

I’d just like to say: thank you!

GitHub’s previous search was not great, and when the new version launched it was a massive leap forward, where it’s now part of my daily workflow. Before this I thought good search at this scale might just be an intractable problem. :)

mickael-kerjean · 2 years ago

Meh before Microsoft acquisition, you could get an API key for any service you want by just making a search on github, not sure how many people knew about it, it was probably a dirty secret but I used to crawl tons of stuff by just rotating API keys found from github, none of that is possible anymore.

On the plus side I don't count how many reports I've done to companies who did leak not only their username/password but also all the cool proxy you could use to go inside their network. The weirdest one of them was a guy working in security at thales which is supposed to handle security sensitive stuff for governments leaking all that information as he was working on a side project involving poker during business hours ...

xoranth · 2 years ago

Do you have any paper/talk that gives more details about the "geometric XOR filter"? If not, is there any plan to publish something?

Yes! My colleague who created it has started working on an open source version so we can publish it. I am not sure when it will be ready, but I'm excited because it is extremely interesting and has a lot of potential use cases.

codetrotter · 2 years ago

> more details about the "geometric XOR filter"?

I know we don’t typically say this here but, you have a very relevant username for your question xoranth :D

dc-programmer · 2 years ago

Does anyone know what geometric means in this context?

skilled · 2 years ago

Hi Luke,

Could you please give an update on whether or not GitHub is still considering adding “sort by recent” to search?

——

E: I just saw you answered that already. It’s a dearly missed feature.

Please see my other comment about why this is difficult: https://news.ycombinator.com/item?id=38638214

a1o · 2 years ago

Hey, it would be nice to be able to browse in other branches besides main

buraksarica · 2 years ago

Really great summary of a huge work! Thank you.

nliang86 · 2 years ago

Here's a detailed text outline with key frames for those who don't have time to watch the 36 minute video:

https://www.videogist.co/videos/lessons-from-building-github...

Nice! Seems like a useful tool for digesting videos.

ngshiheng · 2 years ago

this is awesome. i think it'd be super cool if it can read/summarize the comments too

paradox460 · 2 years ago

Not sure if it can, but Kagi can:

• GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues.

• Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories.

• The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale.

• Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources.

• Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression.

• Iteration speed was improved by making the system easier to change through frequent index version increments without migrations.

• Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs.

• Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions.

• Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain.

• Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time.

ugiox · 2 years ago

Everyone has 36 minutes to watch a video. Just skip bing watching Netflix.

Poor society we are part of, if everything needs to be consumed in 5 minute chunks.

One should also think and reflect about the content being presented. Grasp the ideas.

It is also about honoring the time the speaker put into the presentation preparing it.

Deleted Comment

jakub_g · 2 years ago

Since we talk GH code search, a frequent issue raised on community forums is that some repos stopped being indexed, and yield 0 results. (Happened also to repos in my bigcorp).

Has there been a systemic fix for that issue, other than asking GH support teams to reindex the repo?

AlexErrant · 2 years ago

I also enjoyed the Treesitter talk from 5 years ago by Max Brunsfeld https://www.youtube.com/watch?v=Jes3bD6P0To

I'm currently building a query language whose grammar is very much inspired by Github's search syntax. I'm using Lezer, which is a GLR like Treesitter, so this talk learned me some parser generators (I've no formal CS education). Here's my grammar, a playground, and an example search query if anyone wants to play with it

https://github.com/AlexErrant/Pentive/blob/main/app/src/quer...

https://littletools.app/lezer

    -(a) spider-man -a b -c -"(quote\"d) str" OR "l o l" OR  a b c ((a "c") b) tag:what -deck:"x y"

I just converted the syntax tree to an abstract syntax tree so I can do De Morgan's law transformations. Literally the only time in my professional life where I feel like I'm solving a leetcode-style problem.

Tree-sitter is very cool! We use it to power semantic analysis for GitHub code search.

bakugo · 2 years ago

Do they mention why the new search still doesn't provide sorting options? I found them very useful in the previous version.

(I'm the speaker.) This is frequently requested, and we've tried to answer it in GitHub's feedback forums. The reason it's not implemented is it's quite complicated and there are other things we'd rather work on (sorting by recently was mostly used by scrapers so it's not high yield for us, unfortunately). It's complicated due to some consequences of things I discuss in the talk.

First of all, the old system didn't really implement sorting for "recently changed" files. It implemented sorting for "recently indexed" documents. This was a decent proxy when we only rebuilt the index every two years. But as I discuss in the talk, we rebuild the index all the time now, so it would be pretty worthless as a sorting option. Another reason is due to the content deduplication I also discussed in the talk. When a blob is shared between repositories, what time do you store? Finally, it's complicated because of how Git works. If we did implement this, we'd like it to reflect when the path changed in the branch, which is what people mean by "when did this file last change". But Git doesn't have an efficient way to retrieve that information, so we'd have to build a system to keep track of it.

In short, it's hard to do right and easy to abuse. :-(

NovaX · 2 years ago

I found sorting by recency to be helpful to see what mistakes users would make. There is not much feedback in open-source until after an api is released, at which point it is usually too late to fix problems. Seeing what users did, what frustrated them, and providing feedback helped make a better library. That's typically what internal teams get to do on a large corpus, like google3, and github is equally excellent for these insights.

An alert on recently indexed content that matches keyword subscriptions, ala google alerts, would be an excellent alternative for that use-case.

> not implemented is it's quite complicated and there are other things we'd rather work on

I get this, but why break things if you don’t want to fix them? That’s great that you want to work on other things, but that feature was useful and existed for years and people depended on it. I pay for GitHub and you’re taking away features.

Not you personally, but this attitude is frustrating for me.

As a user, I don’t agree that the new features you implemented are better than the ones you took away.

It’s your choice, of course, but I don’t like this shift in dev mindset where really basic features that have been around since Unix time and are essential to programmers aren’t implemented because they are too hard.

Thanks for engaging in this thread and glad you’re working on this. But hoping since you’re involved in the development that you might be able to shift things a little toward “the good way.”

nayuki · 2 years ago

Seconded. I liked the previous GitHub code search where I sort by recency to pick up new mentions of keywords. Now the new search is useless to me.

lopkeny12ko · 2 years ago

One of the (among many) reasons I stopped using GH code search is that the default ranking algorithm is extremely poor. Search relevance is awful, especially when it comes to surfacing forks.

90+% of the time I'm executing a code search, I'm looking for example uses of a library API in open source code. But most of the time, Github code search just surfaces pages upon pages of the same exact code snippet across the origin repository and hundreds of forks.

For example, one search I executed earlier today: https://github.com/search?q=load%28%22%40rules_foreign_cc%2F...

dsissitka · 2 years ago

You might like grep.app. It looks like it filtered out all of the noise in this case:

https://grep.app/search?q=load%28%22%40rules_foreign_cc//for...

I work on this, and I agree we should do more to improve ranking. Exact duplicates are supressed, but often forks have different versions of some files, so they come up in the results.

If you don't want to see forks, you can exclude them. Here's your same search, but converted to a regex and not including forks. I only get 3 results.

https://github.com/search?q=%2Fload%5C%28%22%40rules_foreign...

frankjr · 2 years ago

Not including forks should be the default IMO. There can be a button in the Advanced section of the sidebar to include them.

Are you at all embarrassed at all by how buggy and intentionally unusable Github has become?

bomewish · 2 years ago

Totally agree both on the intended use case (see some project where some library is being used and how) and result — most often finding bupkis.

planede · 2 years ago

Yeah, Debian code search often works better for this purpose.