Readit News logoReadit News
Posted by u/lennertjansen 9 months ago
Show HN: Airweave – Let agents search any appgithub.com/airweave-ai/ai...
Hey HN, we're Lennert and Rauf. We’re building Airweave (https://github.com/airweave-ai/airweave), an open-source tool that lets agents search and retrieve data from any app or database. Here’s a general intro: https://www.youtube.com/watch?v=EFI-7SYGQ48, and here’s a longer one that shows more real-world use cases, examples of how Airweave is used by Cursor (0:33) and Claude desktop (2:04), etc.: https://youtu.be/p2dl-39HwQo

A couple of months ago we were building agents that interacted with different apps and were frustrated when they struggled to handle vague natural language requests like "resolve that one Linear issue about missing auth configs", "if you get an email from an unsatisfied customer, reimburse their payment in Stripe", or "what were the returns for Q1 based on the financials sheet in gdrive?", only to have the agent inefficiently chain together loads of function calls to find the data or not find it at all and hallucinate.

We also noticed that despite the rise of MCP creating more desire for agents to interact with external resources, the majority of agent dev tooling focused on function calling and actions instead of search. We were annoyed by the lack of tooling that enabled agents to semantically search workspace or database contents, so we started building Airweave first as an internal solution. Then we decided to open-source it and pursue it full time after we got positive reactions from coworkers and other agent builders.

Airweave connects to productivity tools, databases, or document stores via their APIs and transforms their contents into searchable knowledge bases, accessible through a standardized interface for the agent. The search interface is exposed via REST or MCP. When using MCP, Airweave essentially builds a semantically searchable MCP server on top of the resource. The platform handles the entire data pipeline from connection and extraction to chunking, embedding, and serving. To ensure knowledge is current, it has automated sync capabilities, with configurable schedules and change detection through content hashing.

We built it with support for white-labeled multi-tenancy to provide OAuth2-based integration across multiple user accounts while maintaining privacy and security boundaries. We're also actively working on permission-awareness (i.e., RBAC on the data) for the platform.

So happy to share learnings and get insights from your experiences. looking forward to comments!

swyx · 9 months ago
i think before you guys spend the next 2 years building this startup, you should carefully study the connector business and the many carcasses along the way. YC itself has a few. it is probably one of the sloggiest businesses i know and while the success cases like fivetran are great, there is a lot of pain behind the failures. dont ask how i know. good luck and i hope you prove me wrong if you choose to ignore this.

Deleted Comment

topaztee · 9 months ago
why is it one of the sloggiest businesses?
smattiso · 9 months ago
This is a great idea. I have a question:

Typically speaking an LLM is the code driving the control flow and the MCP servers are kind of dumb API endpoints (find_flights, search_hotels, etc) say for a travel MCP.

With your product, how is the LLM made aware of the underlying data store in a more useful way than “func search(query)”?

It seems to be that if you could expose some precomputed API structure into the MCP for a given data store then the LLM could reason more effectively about the data rather than throwing search queries into the void and hoping for the best?

behindsight · 9 months ago
From what I have gathered their main differentiator is taking the approach of assigning each discrete data point its own "entity" definition that is independent but can be extended for each data provider.

So since its all represented by entities, you could treat them like any other vectorised data in your vector data store and use vector search.

It's a nice technique, but probably tricky if they ever venture into encapsulating endpoints in realtime for rapidly changing b2c applications (ratelimits/cronjob latency)

valianter · 9 months ago
Is chat always the best interface for all of these apps? I feel like search is the natural first step, but chat-based search has been around for a while. Feel like an MCP-based version of Glean/Onyx/Moveworks/Dashworks is interesting, but unsure how much better it makes the product. Curious to see why your product is better
raufakdemir · 9 months ago
Co-founder here. The Airweave interface doesn't discriminate which downstream use case it's applied in. Most current developers don't build it for a chat interface at all actually. Instead they fold it into their agents to give them access to user data. At first sight enterprise search looks quite similar, but instead this is a building block for developers to set up integrations for their internal agent / agent product.
throwaway314155 · 9 months ago
Are integrations hooked into via their MCP implementation? Or are you hooking in more traditionally and then exposing MCP on top of that?

Also, are these one-time/event-based syncs well supported by the integration providers? I know for instance that discord (and i assume others like slack) frown upon that sort of wholesale archival/syncing of entire chat rooms, presumably due to security concerns and to maintain their data moats.

Finally (i think), do you have to write custom "diff" logic for each integration in order to maintain up-to-date retrieval for each one? I assume it would be challenging to keep this accurate and well structured across so many different integration providers. Is there something i'm missing that makes keeping a local backup of your data easier for each service?

All in all, looks very cool. Have starred the repo to mess around with tonight.

raufakdemir · 9 months ago
Good questions.

1) the integrations are done traditionally so with REST/SQL. The MCP/REST search layer rests on the data that gets synced.

2) most providers are painless. Slack doesn’t want major exports in one go but most developers point at a single channel anyway so the rate limit errors don’t bite too much.

3) this is all orchestrated by the platform itself. Incremental syncs will receive the latest “watermark state” and sync from there. Hashes are used to compare data for persist actions (update/insert/keep)

alephnan · 9 months ago
Airweave is a large and established mattress brand that’s sold in department stores in Japan. You should look into that
renesultan · 9 months ago
Had meetings with a ton of MCP-server providers, no one came close to Airweave’s retrieval accuracy. I even tried Zapier and similar large companies, didn’t come near airweave. Highly highly recommend if you need third party integrations to your AI agents or workflows. Love the team too, cracked, cool, kind, and always there to support their customers (they even took one of their customers dog on a walk when they couldn’t lol)
risyachka · 9 months ago
Noob here - why would mcp providers have a good accuracy?

Don’t they just adjust existing apis to mcp protocol basically just wrapping them?

raufakdemir · 9 months ago
This is exactly the reason we started building Airweave! The “context” in MCP is a bit deceiving, as it actually provides very little context.
thawab · 9 months ago
Yes, a lot MCP servers are just api wrappers. Airweave looks like it copies the data and has a RAG that is processing your queries.
ayxliu · 9 months ago
I was looking everywhere for some solution like this. Finally! Curious, do you guys integrate with internal data sources within a company?
brene · 9 months ago
Pretty cool stuff. How does it deal with self-hosted data sources? can it run inside a VPC and talk to my RDS instances directly?
raufakdemir · 9 months ago
You can self-host Airweave on Docker or Kubernetes within your VPC. We eventually want to move towards AWS/Azure/GCP marketplace offerings that should make this easier for you. RDS should work - if you get an instance with PSQL/MySQL dialect.