Show HN: Danswer – Open-source question answering across all your docs

I maintain an open source documentation platform, for which I had received a few queries about AI tooling. I'm not into the AI world of development, and my tech stack & distribution approach aren't great to provide AI friendly tech in my project itself, but connecting to external applications that can consume/combine multiple sources seemed like a good potential approach.

I came across Danswer a few days ago as an option for this, so I spent a day building a connector [1]. I was pleasantly surprised how accurate the output was for something like this. I have a few pages detailing my servers and I could ask things like "Where is x server hosted"? and get a correct response accompanied with a link to the right source page.

Some things to be aware of specifically about Danswer: It only works with OpenAI right now, although the team said that open model support is important as a future focus. Additionally it felt fairly heavy to run and required a 30 minute docker build process but I think they've improved on this now with pre-built images, and I'm not familiar with the usual requirements/weight of this kind of tech. Otherwise, things were easy to start up and play around with, even for an AI noob like me. Both their web and text-upload source connectors worked without issue in my testing.

[1]: https://github.com/danswer-ai/danswer/pull/139

gardnr · 3 years ago

There are a couple open source projects that expose llama.cpp and gpt4j models via a compatible OpenAI API. This is one of them: https://github.com/lhenault/simpleAI

sodality2 · 3 years ago

Nowadays falcon-40b is probably more accurate than gpt4j, here's to hoping we get llama.cpp support for falcon builds soon [0]!

[0]: https://github.com/ggerganov/llama.cpp/issues/1602

Sadly completely unusuable for our usecase - if you are targeting Enterprise, you should know better than to use OpenAI models as the only LLM available.

For now I will stick to PrivateGPT and LocalGPT.

jagtstronaut · 3 years ago

Completely unusable for internal docs* should be the caveat. For external doc OpenAI is fine unless you have stuff behind a password.

avereveard · 3 years ago

You may be in a situation where the document is public but the question is confidential, i.e. a user having specific question about an agreement with public TOS that a legal or medical department is managing.

TeMPOraL · 3 years ago

At the very least, I'd start with adding support for the Azure flavor of OpenAI API. It's literally the same models, but the difference is that it's your company deploying those models on Azure, under proper enterprise contract with Microsoft, literally so that they can be safely used with proprietary data.

Weves · 3 years ago

Yea, that's good feedback - we've gotten requests for open source model support from a lot from people we've talked about. It's one of our highest priorities, and should be available soon!

adr1an · 3 years ago

Are you about to use GPT4ALL[^1] or anything else? If you're going with the second option, then please share any link to such resources... I'd be interested.

And, to share with you something: I saw somewhere a tool (maybe it was GPT4ALL itself) that had the ability to expose a OpenAI-compatible local API on localhost:8080... Ah, yes. Here it is. Actually, there are two. They are described as possible backends for Bavarder (that's a free access to multiple online models, API key is not required): https://bavarder.codeberg.page/help/local/

[^1]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-backen...

ssddanbrown · 3 years ago

PeterStuer · 3 years ago

In my experience the QA with documents pattern is fairly straightforward to implement. 90% of the effort to get to a preformant system hoever goes into massaging the documents into semantically meaningfull chuncks. Most business documents, unlike blogposts and news articles, are not just running text. They have a lott of implicit structure that when lost as the typical naive chunckers do, lose much of the contextualized meaning as well.

Agree with the point about intelligent chunking being very important! Each individual app connector can choose how it wants to split each `document` into `section`s (important point: this is customized at an app-level). The default chunker then keeps each section as part of a single chunk as much as possible. The goal here is, as you mentioned, to give each chunk the relevant surrounding context.

Additionally, the indexing process is setup as a composable pipeline under the hood. It would be fairly trivial to plug in different chunkers for different sources as needed in the future.

darkteflon · 3 years ago

Chunking is very important but might, I feel, best be contextualised as one aspect of the bigger substantive challenge, which is how to prevent false negatives at the context retrieval stage - a.k.a. how to ensure your (vector? hybrid?) search returns all relevant context to the LLM’s context window.

Would you mind saying a few words on how Danswer approaches this?

andy99 · 3 years ago

Yes agreed, tooling abounds, the work for anyone who's serious about this is customizing everything so it works with the idiosyncrasies of the documents and questions a customer has. I'm happy to talk to anyone who is interested, we are doing something like this for a company now.

danpetrov · 3 years ago

From FAQ:

> Danswer provides Docker containers that you can run anywhere, so data never has to leave your PVC. The one exception is using GPT for inference but we are working on allowing for locally hosted generative models as well.

Look, you can plug this hole trivially for many companies, by adding support for Azure OpenAI API. It's almost identical to OpenAI API - the main difference is how you pass keys and specify the model to use. But that alone will make it possible to use Danswer with company data in places that signed a relevant contract with Microsoft.

That's a good suggestion! Will look into it, should be fairly easy to support

lmeyerov · 3 years ago

How are you thinking about the "document level access control" to make this viable for business environments?

Ex: If a connected gdrive document gets indexed, but then someone fixes the share settings in google docs for some item to be more restrictive.. How does Danswer avoid leaking that data? Dynamic check before returning any doc that the live federated auth settings safelist the requesting user reading that doc?

Great question! Right now, our access control is very basic. When admins setup connectors to other apps, all documents indexed are accessible by all (meant to be public documents only). Individual users can index private documents by providing their own access tokens for connectors, and those docs will be only available to the user who owns that access token. Improving this is a high priority item for us, as we understand this is a deal-breaker for enterprises.

The immediate plan is to extend our current poll / push based connectors to also grab access information (+ add IdP integrations for cross-app identity). There will be some delay to grab access updates, which will be combatted by the dynamic check with the app / IdP itself at query time that you mentioned (still investigating exactly how this will work).

We are also considering adding support for group based access defined within Danswer itself for sources that don't provide APIs to get access information (default being all-public if not specified). Of course, for these, we will not be able to sync permissions.

ttul · 3 years ago

I wonder how long it will be before Google Workspace just has this feature for your Docs. It can't be long... Question-answering against external docs is something Google could easily add. I worry about the defensibility of startups working in this area as it's so fully in front of the steamroller.

Solvency · 3 years ago

Google can't even make a half rate Bard.

andre-z · 3 years ago

We at Qdrant are glad to be a part of this awesome solution, providing the Vector Database resource for Danswer. https://github.com/qdrant/qdrant

Amazing foundational tools like Qdrant make building in this space so much easier <3

Could you say a few words about why you chose Qdrant for this project? It seems to me there is definitely a place in this space for a back-end focused on retrieval for LLMs that goes beyond simple vector similarity search and encapsulates other metadata creation / indexing and hybrid retrieval techniques to tackle the “false negatives” (missing context) challenge. We’re trying to decide whether leaning on something like Qdrant, Weaviate or Pinecone instead of our current Postgres / pgvector stack might be worth the cost of learning and running extra infra.

TommyCat · 3 years ago

Looks great and will test it out, but for enterprises definitely needs support for Azure/Office 365 integration to index Word, Excel, etc. Lots of docs are stored in Onedrive, Teams channels, and SharePoint. I'm going to test these use cases, but would be nice if it supports it OOB like Google Docs. Also, any thought of OOB connectors to ServiceNow or other ticketing/KB platforms?

Native support for the Microsoft suite of tools is something we plan to add fairly soon! We're a small team, and currently swamped with connector/feature requests so no promises on the timeline.

Ticketing platforms like ServiceNow fall under a similar category, although a bit lower priority in my mind.