Readit News logoReadit News
Posted by u/Weves 3 years ago
Show HN: Danswer – Open-source question answering across all your docsgithub.com/danswer-ai/dan...
My friend and I have been feeling frustrated at how inefficient it is to find information at work. There are so many tools (Slack, Confluence, GitHub, Jira, Google Drive, etc.) and they provide different (often not great) ways to find information. We thought maybe LLMs could help, so over the last couple months we've been spending a bit of time on the side to build Danswer.

It is an open source, self-hosted search tool that allows you to ask questions and get answers across common workspace apps AND your personal documents (via file upload / web scraping)! Full demo here: https://www.youtube.com/watch?v=geNzY1nbCnU&t=2s.

The code (https://github.com/danswer-ai/danswer) is open source and permissively licensed (MIT). If you want to try it out, you can set it up locally with just a couple of commands (more details in our docs - https://docs.danswer.dev/introduction). We hope that someone out there finds this useful

We’d love to hear from you in our Slack (https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-...) or Discord (https://discord.gg/TDJ59cGV2X). Let us know what other features would be useful for you!

ssddanbrown · 3 years ago
I maintain an open source documentation platform, for which I had received a few queries about AI tooling. I'm not into the AI world of development, and my tech stack & distribution approach aren't great to provide AI friendly tech in my project itself, but connecting to external applications that can consume/combine multiple sources seemed like a good potential approach.

I came across Danswer a few days ago as an option for this, so I spent a day building a connector [1]. I was pleasantly surprised how accurate the output was for something like this. I have a few pages detailing my servers and I could ask things like "Where is x server hosted"? and get a correct response accompanied with a link to the right source page.

Some things to be aware of specifically about Danswer: It only works with OpenAI right now, although the team said that open model support is important as a future focus. Additionally it felt fairly heavy to run and required a 30 minute docker build process but I think they've improved on this now with pre-built images, and I'm not familiar with the usual requirements/weight of this kind of tech. Otherwise, things were easy to start up and play around with, even for an AI noob like me. Both their web and text-upload source connectors worked without issue in my testing.

[1]: https://github.com/danswer-ai/danswer/pull/139

gardnr · 3 years ago
There are a couple open source projects that expose llama.cpp and gpt4j models via a compatible OpenAI API. This is one of them: https://github.com/lhenault/simpleAI
sodality2 · 3 years ago
Nowadays falcon-40b is probably more accurate than gpt4j, here's to hoping we get llama.cpp support for falcon builds soon [0]!

[0]: https://github.com/ggerganov/llama.cpp/issues/1602

PeterStuer · 3 years ago
In my experience the QA with documents pattern is fairly straightforward to implement. 90% of the effort to get to a preformant system hoever goes into massaging the documents into semantically meaningfull chuncks. Most business documents, unlike blogposts and news articles, are not just running text. They have a lott of implicit structure that when lost as the typical naive chunckers do, lose much of the contextualized meaning as well.
Weves · 3 years ago
Agree with the point about intelligent chunking being very important! Each individual app connector can choose how it wants to split each `document` into `section`s (important point: this is customized at an app-level). The default chunker then keeps each section as part of a single chunk as much as possible. The goal here is, as you mentioned, to give each chunk the relevant surrounding context.

Additionally, the indexing process is setup as a composable pipeline under the hood. It would be fairly trivial to plug in different chunkers for different sources as needed in the future.

darkteflon · 3 years ago
Chunking is very important but might, I feel, best be contextualised as one aspect of the bigger substantive challenge, which is how to prevent false negatives at the context retrieval stage - a.k.a. how to ensure your (vector? hybrid?) search returns all relevant context to the LLM’s context window.

Would you mind saying a few words on how Danswer approaches this?

andy99 · 3 years ago
Yes agreed, tooling abounds, the work for anyone who's serious about this is customizing everything so it works with the idiosyncrasies of the documents and questions a customer has. I'm happy to talk to anyone who is interested, we are doing something like this for a company now.
danpetrov · 3 years ago
Sadly completely unusuable for our usecase - if you are targeting Enterprise, you should know better than to use OpenAI models as the only LLM available.

For now I will stick to PrivateGPT and LocalGPT.

jagtstronaut · 3 years ago
Completely unusable for internal docs* should be the caveat. For external doc OpenAI is fine unless you have stuff behind a password.
avereveard · 3 years ago
You may be in a situation where the document is public but the question is confidential, i.e. a user having specific question about an agreement with public TOS that a legal or medical department is managing.
TeMPOraL · 3 years ago
At the very least, I'd start with adding support for the Azure flavor of OpenAI API. It's literally the same models, but the difference is that it's your company deploying those models on Azure, under proper enterprise contract with Microsoft, literally so that they can be safely used with proprietary data.
Weves · 3 years ago
Yea, that's good feedback - we've gotten requests for open source model support from a lot from people we've talked about. It's one of our highest priorities, and should be available soon!
adr1an · 3 years ago
Are you about to use GPT4ALL[^1] or anything else? If you're going with the second option, then please share any link to such resources... I'd be interested.

And, to share with you something: I saw somewhere a tool (maybe it was GPT4ALL itself) that had the ability to expose a OpenAI-compatible local API on localhost:8080... Ah, yes. Here it is. Actually, there are two. They are described as possible backends for Bavarder (that's a free access to multiple online models, API key is not required): https://bavarder.codeberg.page/help/local/

[^1]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-backen...

TeMPOraL · 3 years ago
From FAQ:

> Danswer provides Docker containers that you can run anywhere, so data never has to leave your PVC. The one exception is using GPT for inference but we are working on allowing for locally hosted generative models as well.

Look, you can plug this hole trivially for many companies, by adding support for Azure OpenAI API. It's almost identical to OpenAI API - the main difference is how you pass keys and specify the model to use. But that alone will make it possible to use Danswer with company data in places that signed a relevant contract with Microsoft.

Weves · 3 years ago
That's a good suggestion! Will look into it, should be fairly easy to support
lmeyerov · 3 years ago
How are you thinking about the "document level access control" to make this viable for business environments?

Ex: If a connected gdrive document gets indexed, but then someone fixes the share settings in google docs for some item to be more restrictive.. How does Danswer avoid leaking that data? Dynamic check before returning any doc that the live federated auth settings safelist the requesting user reading that doc?

Weves · 3 years ago
Great question! Right now, our access control is very basic. When admins setup connectors to other apps, all documents indexed are accessible by all (meant to be public documents only). Individual users can index private documents by providing their own access tokens for connectors, and those docs will be only available to the user who owns that access token. Improving this is a high priority item for us, as we understand this is a deal-breaker for enterprises.

The immediate plan is to extend our current poll / push based connectors to also grab access information (+ add IdP integrations for cross-app identity). There will be some delay to grab access updates, which will be combatted by the dynamic check with the app / IdP itself at query time that you mentioned (still investigating exactly how this will work).

We are also considering adding support for group based access defined within Danswer itself for sources that don't provide APIs to get access information (default being all-public if not specified). Of course, for these, we will not be able to sync permissions.

ttul · 3 years ago
I wonder how long it will be before Google Workspace just has this feature for your Docs. It can't be long... Question-answering against external docs is something Google could easily add. I worry about the defensibility of startups working in this area as it's so fully in front of the steamroller.
Solvency · 3 years ago
Google can't even make a half rate Bard.
andre-z · 3 years ago
We at Qdrant are glad to be a part of this awesome solution, providing the Vector Database resource for Danswer. https://github.com/qdrant/qdrant
Weves · 3 years ago
Amazing foundational tools like Qdrant make building in this space so much easier <3
darkteflon · 3 years ago
Could you say a few words about why you chose Qdrant for this project? It seems to me there is definitely a place in this space for a back-end focused on retrieval for LLMs that goes beyond simple vector similarity search and encapsulates other metadata creation / indexing and hybrid retrieval techniques to tackle the “false negatives” (missing context) challenge. We’re trying to decide whether leaning on something like Qdrant, Weaviate or Pinecone instead of our current Postgres / pgvector stack might be worth the cost of learning and running extra infra.
TommyCat · 3 years ago
Looks great and will test it out, but for enterprises definitely needs support for Azure/Office 365 integration to index Word, Excel, etc. Lots of docs are stored in Onedrive, Teams channels, and SharePoint. I'm going to test these use cases, but would be nice if it supports it OOB like Google Docs. Also, any thought of OOB connectors to ServiceNow or other ticketing/KB platforms?
Weves · 3 years ago
Native support for the Microsoft suite of tools is something we plan to add fairly soon! We're a small team, and currently swamped with connector/feature requests so no promises on the timeline.

Ticketing platforms like ServiceNow fall under a similar category, although a bit lower priority in my mind.