mtricot (u/mtricot) - Readit News

mtricot commented on Show HN: Airbyte 1.0, Marketplace, AI Assist, GenAI Support and Enterprise GA · Posted by u/bleonard

strider_99 · a year ago

Four years to the date since 0.1.0! Nice work! (https://github.com/airbytehq/airbyte/releases/tag/v0.1.0-alp...)

mtricot · a year ago

Talking about going back memory lane :) The initial name of the project was "conduit"...

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

samspenc · 3 years ago

> - OpenAI: you can host an OSS model if you want to

Just to confirm: you mean models like Facebook's Llama 2 and variants right? Since OpenAI hasn't released any OSS models.

mtricot · 3 years ago

correct

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

everythingmeta · 3 years ago

nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.

Any plans to write a tutorial for fine-tuning local models?

mtricot · 3 years ago

Not at the moment but let me bring that to the team so we can brainstorm what it could look like.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

ramesh31 · 3 years ago

I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.

mtricot · 3 years ago

When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.

- Airbyte has two self-hosted options: OSS & Enterprise

- Langchain: OSS

- OpenAI: you can host an OSS model if you want to

- Pinecone: there are OSS/self-hosted alternatives

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

croes · 3 years ago

Why is the OpenAI from the article title missing?

mtricot · 3 years ago

No good reason. Does "it made the post's title too long" work?

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

amelius · 3 years ago

I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.

mtricot · 3 years ago

Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

gz5 · 3 years ago

Very well written and illustrated, thank you.

When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?

mtricot · 3 years ago

It depends.

Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.

For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

r_thambapillai · 3 years ago

How are you thinking about preventing customer PII making it to OpenAI?

mtricot · 3 years ago

For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.

If you have data with PII:

One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.

The option that jmorgan mention is relevant here, using a "self-hosted" model.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

mritchie712 · 3 years ago

have you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.

mtricot · 3 years ago

On the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.

mtricot commented on Show HN: Chat with your data using LangChain, Pinecone, and Airbyte airbyte.com/tutorials/cha... · Posted by u/mtricot

_pdp_ · 3 years ago

A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.

mtricot · 3 years ago

Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.

The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.