Launch HN: Hello (YC S22) – A search engine for developers

Hi HN, we’re Michael and Justin from Hello Cognition (https://beta.sayhello.so). We're building a better search engine for software developers. Hello saves you time by synthesizing clear explanations to technical questions along with code snippets from the web, showing them right on the search page.

We’ve found that most technical searches fall into a few categories: ad-hoc how-tos, understanding an API, recalling forgotten details, research, or troubleshooting. Google is too broad and shallow of a search tool to be good at this. Even after sifting through the deluge of spammy, irrelevant sites pumped full of SEO, you still have to manually find your answer through discussion boards or documentation. Their “featured snippet” approach works for simple factoid queries but quickly falls apart if a question requires reasoning about information across multiple webpages.

Our approach is narrow and deep — to retrieve detailed information for topics relevant to developers. When you submit a query, we pull raw site data from Bing, rerank them, and extract understanding and code snippets with our proprietary large language models. We use seq-to-seq transformer models to generate a final explanation from all of this input.

For our honors theses at UT Austin, we researched prototypes of large generative language models that can answer complex questions by combining information from multiple sources. We found that GPT-3, GPT-Neo/J/X, and similar autoregressive language models that predict text from left to right are prone to “hallucinating” and generating text inconsistent with the “ground truth” document. Training a sequence-to-sequence language model (T5 derivative) on our custom dataset designed for factual generation yielded much better results with less hallucination.

After creating this prototype, we started actively developing Hello with the idea that searching should be just like talking to a smart friend. We want to build an engine that explains complex topics clearly and concisely, and lets users ask follow-up questions using the context of their previous searches.

For example, when asked “what type of semaphore can function as a mutex?”, Hello pulls in the raw text from all five search results linked on the search page to generate: “A binary semaphore can be used as a mutex. Mutexes and semaphores are two different types of synchronization mechanisms. A mutex is a lock that prevents two threads from accessing the same resource at the same time. A semaphore is used to signal that a resource has become available.” We're biased, of course, but we think that the ability to reason abstractly about information from multiple web pages is a cool thing in a search engine!

We use BERT-based models to extract and rank code snippets if relevant to the query. Our search engine currently does well at answering applicable how-to questions such as “Sort a list of tuples by the second element”, “Set a response cookie in FastAPI”, “Get value of input in React”, “How to implement Dijkstra's algorithm.” Exclusively using our own models has also freed us from dependence on OpenAI.

Hello is and will always be free for individual devs. We haven’t rolled out any paid plans yet, but we’re planning to charge teams per user/month to use on internal data scattered around in wikis, documentation, slack, and emails.

We started Hello Cognition to scratch our own itch, but now we hope to improve the state of information retrieval for the greater developer community. If you'd like to be part of our product feedback and iteration process, we'd love to have you—please contact us at founders@sayhello.so.

We're looking forward to hearing your ideas, feedback, comments, and what would be helpful for you when navigating technical problems!

Awesome work guys! A couple of knee jerk reactions while playing around with this:

1. In my work (also at UT actually: Hook 'em), we've found that the hallucination problem is, in part, lessened by over-parametrizing the model. Places that have the budget to do this have noticed that the performance of ml4code transformers increases linearly for every 1e3 increase in the number of parameters (with no drop off in sight). Love to hear your thoughts on this.

2. I'm concerned that finding code snippets from a short form query is underspecifing the problem too much and may not be the best user-interaction model. Let's compare your system to something like Github Copilot. I pass a query:

> how to normalize the rows of a tensor pytorch

With GitHub Copilot, I can demonstrate intent in the development environment itself with an IO example / comment / both and interact more efficiently. If I see errors in the synthesized snippet, I can change the query in >1 second etc. Etc. This is hard with a search engine style interactive environment. For this query, I had to navigate to the website, type in the query, check the results (which were wrong for me btw. Y'all might need to check correctness of the snippets), copy back the result, maybe go to the relevant thread and parse more closely etc. A good question to keep in mind here would be to figure out how to make this process more interactive.

3. Finally, I just want to say that the website is phenomenal, even on mobile. Kudos on the frontend/backend/architecture side of things.

Also, don't let my or anyone else's comments take away from the awesome work y'all have done!!! I pulled out that example from a paper I read recently called TF-coder. They have a dataset of these examples as part of their supplement material. All the best!

gbro3n · 3 years ago

Thought I'd try this on a problem I've been researching today (which I resolved) where my service worker for offline PWA usage was working for everything except audio files.

I searched the following in say hello.so.

"Service worker fails on request for audio file"

I got back a couple of results related to general service worker use but none that get close to discussing the core problem that lead to the solution.

The same query in Google returns several results that together pointed me to the solution (it was around range headers in requests for media data types).

This is just one example though. I think the problem you are trying to fix is worth the effort. I just wonder if this is where humans are still stronger than computers - gathering unstructured data to use in problem solving.

wayy · 3 years ago

The description of the steps you took is super helpful feedback - thanks! Hello performs best on "how-to" questions at the moment. We're still working to improve troubleshooting type queries.

CodeSgt · 3 years ago

That'll be a difficult adaptation for potential users to make. I think most of us have been conditioned to phrase our queries a certain way to achieve the best results from Google.

Then again maybe that's just me.

harrisonjackson · 3 years ago

Is the assumption you are making that most developers would go to search first? rather than when they hit a blocker or error?

No problem. Good luck with the project.

mlejva · 3 years ago

Hey Michael and Justin, congrats to the launch!

My co-founder and I were building the same product as you are some time ago [1]. We managed to scale it to around 5k WAU before we decided to pivot for various reasons.

If you think there might be any useful information and experience we could share with you, please shoot me an email - vasek@usedevbook.com. I'd love to help in any way I can to help you guys succeed.

[1] https://www.producthunt.com/products/devbook

moneywoes · 3 years ago

Do you mind sharing why you pivoted?

danenania · 3 years ago

Congrats on the launch! I love this idea. I've thought for a long time that something like it should exist. Google results are often lacking in this realm.

I've played around just a bit and clicked some of the preset examples and like what I'm seeing so far. I bookmarked it and will try it out more as I code over the next few days.

Main initial feedback: I'd really like to see version/last-updated-at info accompanying all results. One of the biggest problems with Google for code stuff is finding outdated examples and docs. Even better would be a dropdown that lets me see results depending on the version of the language/framework/tools I'm using.

Thanks for trying it out and good point - we'll look into adding version info

ianbutler · 3 years ago

Hey so I built a search engine doing largely the same thing (and also interviewed with YC during the W20 time) and ultimately we pivoted away due to lack of interest from the developer teams we were pitching, often the startups we were pitching didn't have enough accumulated internal knowledge for the paid plan to be useful. For the ones who did (at the like 200+ person 5+ yrs in business mark) we still weren't seeing the problem being painful enough where companies wanted to pay to solve it.

How do you see navigating this space when this can be considered a nice to have versus a strict need?

Right now we're primarily focused on building a search tool that developers love. Would love to chat more about your experience - shoot us an email at founders@sayhello.so

muds · 3 years ago

izolate · 3 years ago

Congrats on the launch! Looks promising, so I'll try it out for a couple of days.

One feature request at first glance: please default to the system font stack for code snippets. I see you're currently using Consolas, a Microsoft typeface, which is not pleasant to see as a mac user.

You can use this to default to the system font on every platform:

    font-family: "SF Mono", "Monaco", "Inconsolata", "Fira Mono", "Droid Sans Mono", "Source Code Pro", monospace;

FractalHQ · 3 years ago

Why do you consider it unpleasant? I’m a mac user and I really like Consolas. I like to use it in VSCode or when building websites that display code blocks.

rushingcreek · 3 years ago

Thanks for the feedback, we'll take a look at that :)

skilled · 3 years ago

Not to be too critical but the results I got so far have been subpar. Seeing a lot of hyperbole/clickbait articles.

Let's say I'm searching for front-end frameworks. Each article has the word "best" in the title, yet doesn't link to resources like State of JS, Stack Overflow Survey or other similar sites. So, in this context "best" is subjective. I can't be bothered with subjective results when I'm trying to find out what is actually considered "best" or in this case popular.

Those articles are coming from Bing as of right now. Our offering is based on analyzing those articles and summarizing them/picking out the most relevant parts. We definitely plan to augment (and eventually replace Bing) with our own index.

closedloop129 · 3 years ago

Have you considered using blacklists? You could cooperate with Brave and their Goggles: https://news.ycombinator.com/item?id=31837986

sailorganymede · 3 years ago

Personally I’ve never really had an issue with Google - I think my mental model with how it works is to the point it makes sense.

It would be amazing if this could be used for internal documentation however. Like we have so much documentation on our wiki which is just disorganised.

8n4vidtmkvmk · 3 years ago

Stack overflow offers a version for companies. I've never used it, but it sounds like what you might want

asiachick · 3 years ago

I can just imagine it will spawn the same "closed as off topic" and other similar responses for most questions :P

Also, stack overflow's search has always sucked. The way to find stuff on stack overflow has mostly been to use google.

Indexing internal docs is one of our ideas for how to monetize. And it'd be great if you could tell me more about your mental model while using Google -- are there inconveniences that you brush aside?

teekaykay · 3 years ago

Bing has a offering which works on searching through internal documentation as a part of the Office. Works well with Sharepoint and other traditional office products.