They list things like "Fast" and "Regular Expressions", but what about correct results? I often use GitHub web search and it never finds terms which I know for a fact exist. After it fails to return results, I have to instead manually navigate to the file containing the search term and yes it's there.
Sorry you've had a frustrating experience with the product. When we look into complaints about missing results, it's almost always one of two things. First, the repository is not indexed yet, but searching triggers indexing. Future searches work. Second, the files or the repo hit our documented limitations (https://docs.github.com/en/search-github/github-code-search/...). I think these limits are pretty reasonable, but we need to do a better job making it clear why content isn't in the index.
I understand this is important. But the issue I have is that it’s hard or maybe impossible to know what’s been indexed and what isn’t.
I run a few orgs with hundreds of repos. Which are indexed? I don’t know.
This makes your search suck for my organization. I understand the reasons. They aren’t reasonable for me. I don’t want to search using your tool if it won’t work for my org.
Code search isn’t just for what’s popular. It needs to be for what is real and accurate.
When a user search includes a non-indexed repository, GitHub needs to include a warning message along with the search results, something similar to what you just mentioned:
"One or more repositories of this search is not yet indexed, please try your search later for accurate results."
The new UX does a really bad job of helping you realize when you're logged out and search results are being suppressed. It very much implies "zero results" when in fact it means to say "zero results, except in code, where there might be results, but you need to log in first to find out".
Weird, I've never found that to be the case. There's a "flash" sort of message right at the top of the search results that says "Sign in to see results in the `x` org"
Nah, it's always been unreliable. Around 2016 a coworker wrote a code indexer for our private repos and we used that to find occurrences of things which were critical.
We've been burnt sooo many times ('are you sure we changed all the occurrences of this in all our 3252 microservices?').
Between shoddy UX (the + being part of the code you were copying!), blatantly missing features for years (ZenHub, late GitHub actions), GitHub really succeeded only because of timing and network effect. The developers' community is quite powerful.
I'm happy for the original team, but now it's just another MS acquisition I hope we'll stop using.
I don't know if it is related to the changes to the code search (I use github exclusively as guest), but in the last year github got slower and heavier to the point that even scrolling a page showing a file of 200 lines is very painful. One with more than 2000 lines crashes the tab. I don't have the most powerful machines at my disposal, but what I have should be more than powerful enough to browse a repository. My intuition points at the syntax highligther and the file browser/go-to-symbol as possible causes. Anybody knows of some magic setting I can use to have a better experience? Right now I am forced to avoid using it as much as possible, and clone the repositories to browse them locally.
The magic setting seems to be don't use Github, unfortunately.
There were a couple GH discussion threads about the dreadful UI rollout, but someone complained one too many times about (code) search returning the wrong results so they've since been locked. For me it's the issue search that seems to consistently return the wrong results, and the text input widgets that are glitchy as hell in Firefox.
Performance-wise it's worth keeping in mind that Github is now rendering all the text client side. If you go through the discussion threads you can see where the GH devs didn't know how to account for multibyte codepoints and how that broke their non-native rendering with things like emoji and Chinese characters.
Oh yeah, they use react now, and have to use smart tricks so you can actually select/search highlighted code.
The code is also probably added as you scroll too, using a virtual list.
Stale data, PR’s and such don’t update properly, page loads are glacial, half the time it’s doing some loading bar of its own-no doubt as part of some trying-to-be-clever SPA thing, and it’s awful.
In the actions overview, just the spinner icon (svg w/ css animation) takes 25-40% cpu and 10% gpu in my chrome. Enough to keep my fingers very warm on the laptop
I’ve noticed the code reader is worthless on even slightly out of date browsers now, and even on newer browsers it tends to choke and stutter on large files. Sad :( it used to be the best
GitHub’s previous search was not great, and when the new version launched it was a massive leap forward, where it’s now part of my daily workflow. Before this I thought good search at this scale might just be an intractable problem. :)
Meh before Microsoft acquisition, you could get an API key for any service you want by just making a search on github, not sure how many people knew about it, it was probably a dirty secret but I used to crawl tons of stuff by just rotating API keys found from github, none of that is possible anymore.
On the plus side I don't count how many reports I've done to companies who did leak not only their username/password but also all the cool proxy you could use to go inside their network. The weirdest one of them was a guy working in security at thales which is supposed to handle security sensitive stuff for governments leaking all that information as he was working on a side project involving poker during business hours ...
Yes! My colleague who created it has started working on an open source version so we can publish it. I am not sure when it will be ready, but I'm excited because it is extremely interesting and has a lot of potential use cases.
• GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues.
• Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories.
• The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale.
• Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources.
• Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression.
• Iteration speed was improved by making the system easier to change through frequent index version increments without migrations.
• Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs.
• Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions.
• Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain.
• Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time.
Since we talk GH code search, a frequent issue raised on community forums is that some repos stopped being indexed, and yield 0 results. (Happened also to repos in my bigcorp).
Has there been a systemic fix for that issue, other than asking GH support teams to reindex the repo?
I'm currently building a query language whose grammar is very much inspired by Github's search syntax. I'm using Lezer, which is a GLR like Treesitter, so this talk learned me some parser generators (I've no formal CS education). Here's my grammar, a playground, and an example search query if anyone wants to play with it
-(a) spider-man -a b -c -"(quote\"d) str" OR "l o l" OR a b c ((a "c") b) tag:what -deck:"x y"
I just converted the syntax tree to an abstract syntax tree so I can do De Morgan's law transformations. Literally the only time in my professional life where I feel like I'm solving a leetcode-style problem.
(I'm the speaker.) This is frequently requested, and we've tried to answer it in GitHub's feedback forums. The reason it's not implemented is it's quite complicated and there are other things we'd rather work on (sorting by recently was mostly used by scrapers so it's not high yield for us, unfortunately). It's complicated due to some consequences of things I discuss in the talk.
First of all, the old system didn't really implement sorting for "recently changed" files. It implemented sorting for "recently indexed" documents. This was a decent proxy when we only rebuilt the index every two years. But as I discuss in the talk, we rebuild the index all the time now, so it would be pretty worthless as a sorting option. Another reason is due to the content deduplication I also discussed in the talk. When a blob is shared between repositories, what time do you store? Finally, it's complicated because of how Git works. If we did implement this, we'd like it to reflect when the path changed in the branch, which is what people mean by "when did this file last change". But Git doesn't have an efficient way to retrieve that information, so we'd have to build a system to keep track of it.
In short, it's hard to do right and easy to abuse. :-(
I found sorting by recency to be helpful to see what mistakes users would make. There is not much feedback in open-source until after an api is released, at which point it is usually too late to fix problems. Seeing what users did, what frustrated them, and providing feedback helped make a better library. That's typically what internal teams get to do on a large corpus, like google3, and github is equally excellent for these insights.
An alert on recently indexed content that matches keyword subscriptions, ala google alerts, would be an excellent alternative for that use-case.
> not implemented is it's quite complicated and there are other things we'd rather work on
I get this, but why break things if you don’t want to fix them? That’s great that you want to work on other things, but that feature was useful and existed for years and people depended on it. I pay for GitHub and you’re taking away features.
Not you personally, but this attitude is frustrating for me.
As a user, I don’t agree that the new features you implemented are better than the ones you took away.
It’s your choice, of course, but I don’t like this shift in dev mindset where really basic features that have been around since Unix time and are essential to programmers aren’t implemented because they are too hard.
Thanks for engaging in this thread and glad you’re working on this. But hoping since you’re involved in the development that you might be able to shift things a little toward “the good way.”
One of the (among many) reasons I stopped using GH code search is that the default ranking algorithm is extremely poor. Search relevance is awful, especially when it comes to surfacing forks.
90+% of the time I'm executing a code search, I'm looking for example uses of a library API in open source code. But most of the time, Github code search just surfaces pages upon pages of the same exact code snippet across the origin repository and hundreds of forks.
I work on this, and I agree we should do more to improve ranking. Exact duplicates are supressed, but often forks have different versions of some files, so they come up in the results.
If you don't want to see forks, you can exclude them. Here's your same search, but converted to a regex and not including forks. I only get 3 results.
Sorry you've had a frustrating experience with the product. When we look into complaints about missing results, it's almost always one of two things. First, the repository is not indexed yet, but searching triggers indexing. Future searches work. Second, the files or the repo hit our documented limitations (https://docs.github.com/en/search-github/github-code-search/...). I think these limits are pretty reasonable, but we need to do a better job making it clear why content isn't in the index.
I understand this is important. But the issue I have is that it’s hard or maybe impossible to know what’s been indexed and what isn’t.
I run a few orgs with hundreds of repos. Which are indexed? I don’t know.
This makes your search suck for my organization. I understand the reasons. They aren’t reasonable for me. I don’t want to search using your tool if it won’t work for my org.
Code search isn’t just for what’s popular. It needs to be for what is real and accurate.
"One or more repositories of this search is not yet indexed, please try your search later for accurate results."
Likewise for the 2nd case.
We've been burnt sooo many times ('are you sure we changed all the occurrences of this in all our 3252 microservices?').
Between shoddy UX (the + being part of the code you were copying!), blatantly missing features for years (ZenHub, late GitHub actions), GitHub really succeeded only because of timing and network effect. The developers' community is quite powerful.
I'm happy for the original team, but now it's just another MS acquisition I hope we'll stop using.
Personally, my Firefox Nightly on Android will regularly crash (along with the whole System UI) when opening / trying to type into a GitHub page
There were a couple GH discussion threads about the dreadful UI rollout, but someone complained one too many times about (code) search returning the wrong results so they've since been locked. For me it's the issue search that seems to consistently return the wrong results, and the text input widgets that are glitchy as hell in Firefox.
Performance-wise it's worth keeping in mind that Github is now rendering all the text client side. If you go through the discussion threads you can see where the GH devs didn't know how to account for multibyte codepoints and how that broke their non-native rendering with things like emoji and Chinese characters.
Stale data, PR’s and such don’t update properly, page loads are glacial, half the time it’s doing some loading bar of its own-no doubt as part of some trying-to-be-clever SPA thing, and it’s awful.
There's a hanging draft for it in Gitea (inb4 yes also Forgejo).
https://github.com/go-gitea/gitea/pull/20311
Maybe this is the push I need to actually start using taskwarrior...
https://github.com/GothenburgBitFactory/bugwarrior
GitHub’s previous search was not great, and when the new version launched it was a massive leap forward, where it’s now part of my daily workflow. Before this I thought good search at this scale might just be an intractable problem. :)
On the plus side I don't count how many reports I've done to companies who did leak not only their username/password but also all the cool proxy you could use to go inside their network. The weirdest one of them was a guy working in security at thales which is supposed to handle security sensitive stuff for governments leaking all that information as he was working on a side project involving poker during business hours ...
I know we don’t typically say this here but, you have a very relevant username for your question xoranth :D
Could you please give an update on whether or not GitHub is still considering adding “sort by recent” to search?
——
E: I just saw you answered that already. It’s a dearly missed feature.
https://www.videogist.co/videos/lessons-from-building-github...
• GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues.
• Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories.
• The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale.
• Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources.
• Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression.
• Iteration speed was improved by making the system easier to change through frequent index version increments without migrations.
• Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs.
• Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions.
• Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain.
• Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time.
Poor society we are part of, if everything needs to be consumed in 5 minute chunks.
One should also think and reflect about the content being presented. Grasp the ideas.
It is also about honoring the time the speaker put into the presentation preparing it.
Deleted Comment
Has there been a systemic fix for that issue, other than asking GH support teams to reindex the repo?
I'm currently building a query language whose grammar is very much inspired by Github's search syntax. I'm using Lezer, which is a GLR like Treesitter, so this talk learned me some parser generators (I've no formal CS education). Here's my grammar, a playground, and an example search query if anyone wants to play with it
https://github.com/AlexErrant/Pentive/blob/main/app/src/quer...
https://littletools.app/lezer
I just converted the syntax tree to an abstract syntax tree so I can do De Morgan's law transformations. Literally the only time in my professional life where I feel like I'm solving a leetcode-style problem.First of all, the old system didn't really implement sorting for "recently changed" files. It implemented sorting for "recently indexed" documents. This was a decent proxy when we only rebuilt the index every two years. But as I discuss in the talk, we rebuild the index all the time now, so it would be pretty worthless as a sorting option. Another reason is due to the content deduplication I also discussed in the talk. When a blob is shared between repositories, what time do you store? Finally, it's complicated because of how Git works. If we did implement this, we'd like it to reflect when the path changed in the branch, which is what people mean by "when did this file last change". But Git doesn't have an efficient way to retrieve that information, so we'd have to build a system to keep track of it.
In short, it's hard to do right and easy to abuse. :-(
An alert on recently indexed content that matches keyword subscriptions, ala google alerts, would be an excellent alternative for that use-case.
I get this, but why break things if you don’t want to fix them? That’s great that you want to work on other things, but that feature was useful and existed for years and people depended on it. I pay for GitHub and you’re taking away features.
Not you personally, but this attitude is frustrating for me.
As a user, I don’t agree that the new features you implemented are better than the ones you took away.
It’s your choice, of course, but I don’t like this shift in dev mindset where really basic features that have been around since Unix time and are essential to programmers aren’t implemented because they are too hard.
Thanks for engaging in this thread and glad you’re working on this. But hoping since you’re involved in the development that you might be able to shift things a little toward “the good way.”
90+% of the time I'm executing a code search, I'm looking for example uses of a library API in open source code. But most of the time, Github code search just surfaces pages upon pages of the same exact code snippet across the origin repository and hundreds of forks.
For example, one search I executed earlier today: https://github.com/search?q=load%28%22%40rules_foreign_cc%2F...
https://grep.app/search?q=load%28%22%40rules_foreign_cc//for...
If you don't want to see forks, you can exclude them. Here's your same search, but converted to a regex and not including forks. I only get 3 results.
https://github.com/search?q=%2Fload%5C%28%22%40rules_foreign...