Readit News logoReadit News
mtricot commented on Show HN: Airbyte 1.0, Marketplace, AI Assist, GenAI Support and Enterprise GA    · Posted by u/bleonard
strider_99 · a year ago
Four years to the date since 0.1.0! Nice work! (https://github.com/airbytehq/airbyte/releases/tag/v0.1.0-alp...)
mtricot · a year ago
Talking about going back memory lane :) The initial name of the project was "conduit"...
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
samspenc · 3 years ago
> - OpenAI: you can host an OSS model if you want to

Just to confirm: you mean models like Facebook's Llama 2 and variants right? Since OpenAI hasn't released any OSS models.

mtricot · 3 years ago
correct
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
everythingmeta · 3 years ago
nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.

Any plans to write a tutorial for fine-tuning local models?

mtricot · 3 years ago
Not at the moment but let me bring that to the team so we can brainstorm what it could look like.
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
ramesh31 · 3 years ago
I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.
mtricot · 3 years ago
When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.

- Airbyte has two self-hosted options: OSS & Enterprise

- Langchain: OSS

- OpenAI: you can host an OSS model if you want to

- Pinecone: there are OSS/self-hosted alternatives

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
croes · 3 years ago
Why is the OpenAI from the article title missing?
mtricot · 3 years ago
No good reason. Does "it made the post's title too long" work?
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
amelius · 3 years ago
I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.
mtricot · 3 years ago
Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
gz5 · 3 years ago
Very well written and illustrated, thank you.

When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?

mtricot · 3 years ago
It depends.

Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.

For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
r_thambapillai · 3 years ago
How are you thinking about preventing customer PII making it to OpenAI?
mtricot · 3 years ago
For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.

If you have data with PII:

One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.

The option that jmorgan mention is relevant here, using a "self-hosted" model.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
mritchie712 · 3 years ago
have you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.
mtricot · 3 years ago
On the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.
mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte   airbyte.com/tutorials/cha... · Posted by u/mtricot
_pdp_ · 3 years ago
A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.
mtricot · 3 years ago
Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.

The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.

u/mtricot

KarmaCake day136June 6, 2016
About
Co-Founder & CEO @ Airbyte
View Original