Readit News logoReadit News
sagaro · 2 years ago
All these products that pitch about using AI to find insights from your data always end up looking pretty in demos and fall short in reality. This is not because the product is bad, but because there is enormous amount of nuance in DB/Tables that becomes difficult to manage. Most startups evolve too quickly and product teams generally tries to deliver by hacking some existing feature. Columns are added, some columns get new meaning, some feature is identified by looking at a combination of 2 columns etc. All this needs to be documented properly and fed to the AI and there is no incentive for anyone to do it. If the AI gives the right answer, everyone is like wow AI is so good, we don't need the BAs. If the AI gives terrible answers they are like "this is useless". No one goes "wow, the data engineering team did a great job keeping the AI relevant".
joshstrange · 2 years ago
I couldn’t agree more. I’ve hooked up things to my DB with AI in an attempt to “talk” to it but the results have been lackluster. Sure it’s impressive when it does get things right but I found myself spending a bunch of time adding to the prompt to explain how the data is organized.

I’m not expecting any LLM to just understand it, heck another human would need the same rundown from me. Maybe it’s worth keeping this “documentation” up to date but my take away was that I couldn’t release access to the AI because it got things wrong too often and I could anticipate every question a user might ask. I didn’t want it to give out wrong answers (this DB is used for sales) since spitting out wrong numbers would be just as bad as my dashboards “lying”.

Demo DBs aren’t representative of shipping applications and so the demos using AI are able to have an extremely high success rate. My DB, with deprecated columns, possibly confusing (to other people) naming, etc had a much higher error rate.

eurekin · 2 years ago
Speculating

How about a chat interface, where you correct the result and provide more contextual information about those columns?

Those chats could be later fed back to the model and ran a DPO optimisation on top

eek2121 · 2 years ago
Welcome to AI in general.

Billions wasted on a pointless endeavor.

10 years from now folks are going to be laughing at how billions of dollars and productivity was flushed down the drain to support Microsoft Word 2.0.

AI is a bubble. Do yourself a favor and short (or buy put options) the companies that only have "AI" for a business model.

Also short Intel, because Intel.

lmeyerov · 2 years ago
Our theory is we are having simultaneously a bit of a Google moment and a Tableau moment. There is à lot more discovery & work to pull it off, but the dam has been broken. It's been am exciting time to work through with our customers:

* Google moment: AI can now watch and learn how you and your team do data. Around the time Google pagerank came around, the Yahoo-style search engines were highly curated, and the semantic web people were writing xml/rdf schema and manually mapping all data to it. Google replaced slow and expensive work with something easier, higher quality, and more scalable + robust. We are making Louie.ai learn both ahead of time and as the system gets used, so data people can also get their Google moment. Having a tool that works with you & your team here is amazing.

* Tableau moment: A project or data owner can now guide a lot more without much work. Dashboarding used to require a lot of low-level custom web dev etc, while Tableau streamlined it so that a BI lead good at SQL and who understood the data & design can go much further without a big team and in way less time. Understanding the user personas, and adding abstractions for facilitating them, were a big deal for delivery speed, cost, and achieved quality. Arguably the same happened as Looker in introduced LookML and foreshadowed the whole semantic layer movement happening today. To help owners ensure quality and security, we have been investing a lot in the equivalent abstractions in Louie.ai for making data and more conversational. Luckily, while the AI part is new, there is a lot more precedent on the data ops side. Getting this right is a big deal in team settings and basically any time the stakes are high.

scoot · 2 years ago
> Around the time Google pagerank came around, the Yahoo-style search engines were highly curated

Hmmm, no. Altavista was the go-to search engine at the time (launched 1995), and was a crawler (i.e. not a curated catalog/directory) based search. Lycos predates that but had keyword rather than natural language search.

Google didn't launch until 1998.

tucnak · 2 years ago
Is that right? You do all that at Louie.ai?
j-a-a-p · 2 years ago
Mostly agree. I suggest to keep using ETL and create a data warehouse that irons outs most of these nuances that are needed for a production database. On a data warehouse with good meta data I can imagine this will work great.
sagaro · 2 years ago
I think getting clean tables/ETLs is a big blocker for move fast and break things. I would be more interested in actually github copilot style sql IDE (like datagrip etc.), which has access to all the queries written by all the people within a company. Which runs on a local server or something for security reasons and to get the nod from the IT/Sec department.

And basically when you next write queries, it just auto completes for you. This would improve the productivity of the analysts a lot. With the flexibility of them being able to tweak the query. Here if something is not right, the analyst updates. The Copilot AI keeps learning and giving weights to recent queries more than older queries.

Unlike the previous solution where if something breaks, you can do nothing till you clean up the ETL and redeploy it.

zurfer · 2 years ago
that is correct. GPT-4 is good on well-modelled data out of the box, but struggles with a messy and incomplete data model.

Documenting data definitely helps to close that gap.

However the last part you describe is nothing new (BI teams taking credit, and pushing on problems to data engineers). In fact there is a chance that tools like vanna.ai or getdot.ai bring engineers closer to business folks. So more honest conversations, more impact, more budget.

Disclaimer: I am a co-founder at getdot.ai :)

lmeyerov · 2 years ago
Agreed, maybe I wasn't clear enough. I don't view it as BI team vs platform team vs whoever. Maybe a decrease in the need for PhD AI consultants for small projects, or to wait for some privileged IT team for basic tasks, so they can focus on bigger things.

Instead of Herculean data infra projects, this is a good time for figuring out new policy abstractions, and finding more productive divisions of labor between different days stakeholders and systems. Machine-friendly abstractions and structure are tools for predictable collaboration and automation. More doing, less waiting.

More practically, an increasing part of the Louie.ai stack is helping get the time-consuming quality, guardrails, security, etc parts under easier control of small teams building things. As-is, it takes a lot to give a great experience.

sagaro · 2 years ago
There used to be a company/product called Business Objects aka BO (SAP bought them), which had folks meticulously map every relationship. When done correctly, it was pretty good. You could just drag drop and get answers immediately.

So yes, I can understand if there is incentive for the startups to invest in Data Engineers to make well maintained data models.

But I do think, the most important value here is not the chatgpt interface, it is getting DEs to maintain the data model in a company where product/biz is moving fast and breaking things. If that is done, then existing tools (Power BI for instance has "ask in natural language" feature) will be able to get the job done.

The google moment, the other person talks about in another comment, is where google or 1998 didn't require a webpage owner to do anything. They didn't need him/her to make something in a different format. Use specific tags. Use some tags around key words etc. It was just "you do what you do, and magically we will crawl and make sense of it".

Here unfortunately that is not the case. Say in a ecom business which always delivers in 2 days for free, a new product is launched (same day delivery for $5 dollars), the sales table is going to get two extra columns "is_same_day_delivery_flag" and "same_day_delivery_fee". The revenue definition will change to include this shipping charges. A new filter will be there, if someone wants to see the opt in rate for how many are going for same day delivery or how fast it is growing. Current table probably has revenue. But now revenue = revenue + same_day_delivery_fee and someone needs to make the BO connection to this. And after launch, you notice you don't have enough capacity to do same day shipping, so sometimes you just have to return the fee and send it as normal delivery. Here the is_same_day_delivery_flag is true, but the same_day_delivery_fee is 0. And so on and on...

Getting DE to keep everything up to date in a wiki is tough, let alone a BO type solution. But I do hope getdot.ai etc. someone incentivizes them to change this way of doing things.

anon291 · 2 years ago
The AI needs to truly be 'listening' in in a passive way to all Slack messages, virtual meetings, code commits, etc and really be present whenever the 'team' is in order to get anything done.
XCSme · 2 years ago
Or maybe the database documentation has to be very comprehensive and the AI should have access to it.

Deleted Comment

bob1029 · 2 years ago
The most success I had with AI+SQL was when I started feeding errors from the sql provider back to the LLM after each iteration.

I also had a formatted error message wrapper that would strongly suggest querying system tables to discover schema information.

These little tweaks made it scary good at finding queries, even ones requiring 4+ table joins. Even without any examples or fine tuning data.

echelon · 2 years ago
Please turn this into a product. There's enormous demand for that.
bob1029 · 2 years ago
I feel like by the time I could turn it into a product, Microsoft & friends will release something that makes it look like a joke. If there is no one on the SQL Server team working on this right now, I don't know what the hell their leadership is thinking.

I am not chasing this rabbit. Someone else will almost certainly catch it first. For now, this is a fun toy I enjoy in my free time. The moment I try to make money with it the fun begins to disappear.

Broadly speaking, I do think this is approximately the only thing that matters once you realize you can put pretty much anything in a big SQL database. What happens when 100% of the domain is in-scope of an LLM that has iteratively optimized itself against the schema?

SOLAR_FIELDS · 2 years ago
There are already several products out there with varying success.

Some findings after I played with it awhile:

- Langchain already does something like this - a lot of the challenge is not with the query itself but efficiently summarizing data to fit in the context window. In other words if you give me 1-4 tables I can give you a product that will work well pretty easy. But when your data warehouse has tens or hundreds of tables with columns and meta types now we need to chain together a string of queries to arrive at the answer and we are basically building a state machine of sorts that has to do fun and creative RAG stuff - the single biggest thing that made a difference in effectiveness was not what op mentioned at all, but instead having a good summary of what every column in the db was stored in the db. This can be AI generated itself, but the way Langchain attempts to do it on the fly is slow and rather ineffective (or at least was the case when I played with it last summer, it might be better now).

Not affiliated, but after reviewing the products out there the data team I was working with ended up selecting getdot.ai as it had the right mix of price, ease of use, and effectiveness.

petters · 2 years ago
It sounds like pretty standard constructions with OpenAI's API. I have a couple of such iterative scripts myself for bash commands, SQL etc.

But sure, why not!

l5870uoo9y · 2 years ago
You can check this out https://www.sqlai.ai. It has AI-powered generators for:

- Generate SQL

- Generate optimized SQL

- Fix query

- Optimize query

- Explain query

Disclaimer: I am the solo developer behind it.

chrisjh · 2 years ago
Shameless plug – we're working on this at Velvet (https://usevelvet.com) and would love feedback. Our tool can connect and query across disparate data sources (databases and event-based systems) and allows you to write natural language questions that are turned automatically into SQL queries (and even make those queries into API endpoints you can call directly!). My email is in my HN profile if anyone wants to try it out or has feedback.
Kiro · 2 years ago
What made you say this? How is this different from the hundreds of AI startups already focusing on this, or even the submission that we're having this conversation on?
BenderV · 2 years ago
Shameless plug - https://github.com/BenderV/ada

It's an open source BI tool that does just that.

mcapodici · 2 years ago
I would be tempted to pivot to that! I am working on similar for CSS (see bio) but if that doesn’t work out my plan was to pivot to other languages.
teaearlgraycold · 2 years ago
Someone get YC on the phone
quickthrower2 · 2 years ago
Or open source? You could get 10k stars :-)
vivzkestrel · 2 years ago
who are the most recent signed up users and what is their hashed password? what is stopping me from running this query on your database?
hot_gril · 2 years ago
Same thing stopping you from executing arbitrary SQL on the DB.
bongodongobob · 2 years ago
I'm really curious as to the reasoning behind your question and why you think an LLM generated query somehow would have unfettered access and permissions.
sigmoid10 · 2 years ago
What's stopping anyone who can run ordinary SQL queries? The LLM just simplifies interaction, it is neither the right tool nor the right place to enforce user rights.
aussieguy1234 · 2 years ago
I've already done this with GPT-4.

It goes something like this:

Here's the table structure from MySQL cli `SHOW TABLE` statements for my tables I want to query.

Now given those tables, give me a query to show me my cart abandonment rate (or, some other business metric I want to know).

Seems to work pretty well.

zainhoda · 2 years ago
Author of the package here. That's pretty much what this package does just with optimization around what context gets sent to the LLM about the database:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/b...

And then of course once you have the SQL, you can put it in an interface where you the SQL can be run automatically and then get a chart etc.

lopatin · 2 years ago
What surprises me is the amount of programmers and analysts that I meet the don’t do this yet. Writing complex, useful SQL queries is probably the most valuable thing that ChatGPT does for me. It makes you look like a god to stakeholders, and “I’m not strong in SQL” is no longer an excuse for analysis tasks.
EZ-E · 2 years ago
Same for excel - I suck at it but ChatGPT is really good at writing these formulas. I feel this is one area LLMs are pretty good at.
zainhoda · 2 years ago
I've noticed this as well. Do you have any theories as to why that is?
kulikalov · 2 years ago
While I recognize the efforts in developing natural language to SQL translation systems, I remain skeptical. The core of my concern lies in the inherent nature of natural language and these models, which are approximative and lack precision. SQL databases, on the other hand, are built to handle precise, accurate information in most cases. Introducing an approximative layer, such as a language model, into a system that relies on precision could potentially create more problems than it solves, leading me to question the productivity of these endeavors in effectively addressing real-world needs.
thom · 2 years ago
The key here is to always have some structured intermediate layer that can allow people to evaluate what it is the system thinks they're asking, short of the full complexity of SQL. You'll need something like this for disambiguation anyway - are "Jon Favreau movies" movies starring or directed by Jon Favreau? Or do you mean the other Jon Favreau? Don't one-shot anything important end to end from language. Use language to guide users towards unambiguous and easy to reason about compositions of smaller abstractions.
totalhack · 2 years ago
I think AI to direct SQL will have use cases, but I personally have found it more of a fun toy than anything close to the main way I would interact with my data, at least in the course of running a business.

I am a fan of a semantic layer (I made an open source one but there are others out there), and think having AI talk to a semantic layer has potential. Though TBH a good UI over a semantic layer is so easy it makes the natural language approach moot.

https://github.com/totalhack/zillion

pietz · 2 years ago
I think you have a point. I also think people who share this mindset and keep sticking with it will have a tough future.

The technical benefits and potentials clearly outshine the problems and challenges. That's also true in this example. You just have to let go of some principles that were helpful in the past but aren't anymore.

kulikalov · 2 years ago
What technical benefits?
samstave · 2 years ago
Reverse idea:

Use this to POPULATE sql based on captured NLP "surveillance" -- for example, build a DB of things I say as my thing listens to me, and categorize things, topics, place, people etc mentioned.

Keep count of experiencing the same things....

When I say I need to "buy thing" build table of frequency for "buy thing" etc...

Effectively - query anything you've said to Alexa and be able to map behaviors/habits/people/things...

If I say - Bob's phone number is BLAH. It add's bob+# to my "random people I met today table" with a note of "we met at the dog park"

Narrate yourself into something journaled.

Makes it easy to name a trip "Hike Mount Tam" then log all that - then have it create a link in the table to the pics folder on your SpaceDrive9000.ai and then you have a full narration with links to the pics that you take.

kulikalov · 2 years ago
You are describing use case for a vector database
codegeek · 2 years ago
I have been keeping track of a few products like these including some that are YC backed. Interesting space as I am looking for a solution myself:

- Minds DB (YC W20) https://github.com/mindsdb/mindsdb

- Buster (YC W24) https://buster.so

- DB Pilot https://dbpilot.io

and now this one

refset · 2 years ago
It's not a public facing product, but there was a talk from a team at Alibaba a couple of months ago during CMU's "ML⇄DB Seminar Series" [0] on how they augmented their NL2SQL transformer model with "Semantics Correction [...] a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors" [1]. It will be interesting to see whether VC-backed teams can keep up with the state of the art coming out of BigCorps.

[0] "Alibaba: Domain Knowledge Augmented AI for Databases (Jian Tan)" - https://www.youtube.com/watch?v=dsgHthzROj4&list=PLSE8ODhjZX...

[1] "CatSQL: Towards Real World Natural Language to SQL Applications" - https://www.vldb.org/pvldb/vol16/p1534-fu.pdf

ignoramous · 2 years ago
See also SQLCoder by defog.ai: https://github.com/defog-ai/sqlcoder
lmeyerov · 2 years ago
We have been piloting louie.ai with some fairly heavy orgs that may be relevant: Cybersecurity incident responders, natural disaster management, insurance fraud, and starting more regular commercial analytics (click streams, ...)

A bit unusual compared to the above, we find operational teams need more than just SQL, but also Python and more operational DBs (Splunk, OpenSearch, graph DBs, Databricks, ...). Likewise, due to our existing community there, we invest a lot more in data viz (GPU, ..) and AI + graph workflows. These have been through direct use, like Python notebooks & interactive dashboards except where code is more opt-in where desired or for checking the AI's work, and new, embedded use for building custom apps and dashboards that embed conversational analytics.

kszucs · 2 years ago
Please add Ibis Birdbrain https://ibis-project.github.io/ibis-birdbrain/ to the list. Birdbrain is an AI-powered data bot, built on Ibis and Marvin, supporting more than 18 database backends.

See https://github.com/ibis-project/ibis and https://ibis-project.org for more details.

codyvoda · 2 years ago
note that Ibis Birdbrain is very much work-in-progress, but should provide an open-source solution to do this w/ 20+ backends

old demo here: https://gist.github.com/lostmygithubaccount/08ddf29898732101...

planning to finish it...soon...

zurfer · 2 years ago
I would love to bring your attention also to getdot.ai We launched it on Hackernews with an analysis of HN post data. https://news.ycombinator.com/item?id=38709172

the main problems we see in the space: 1) good interface design: nobody wants another webapp if they can use Slack or Teams 2) learning enough about the business and usually messy data model to always give correct answers or say I don't know.

bredren · 2 years ago
Have you written up any results of your experience with each?

I’m interested in a survey of this field so far and would read it.

codegeek · 2 years ago
Not yet but not a bad idea if I can get to test them all soon :)
pylua · 2 years ago
I don’t fully understand the use of business case after reading the documentation. Is it really a time save?
MattGaiser · 2 years ago
It would be for people who are not that fluent in SQL. Even as a dev, I find ChatGPT to be easier for writing queries than hand coding them as I do it so infrequently.
EmilStenstrom · 2 years ago
Allow people that don't know SQL to query a database.

Dead Comment

markding00 · 2 years ago
also Brewit.ai (YC W23) https://brewit.ai/
pamelafox · 2 years ago
I love that this exists but I worry how it uses the term “train”, even in quotes, as I spend a lot of time explaining how RAG works and I try to emphasize that there is no training/fine-tuning involved. Just data preparation, chunking and vectorization as needed.
zainhoda · 2 years ago
Author of the package here. I would be open to alternative suggestions on terminology! The problem is that our typical user has never encountered RAG before.
pamelafox · 2 years ago
Hm we typically talk about that phase as “data ingestion”. We also use verbs extract, chunk, index, but those are parts of ingestion. Another phrase we use is “data preparation”. The tricky thing is that your train() method is on the vanna object, so if you called it “ingest”, it wouldn’t be clear where the data was headed. But I think it’d be clearer than “train” given that has such a specific meaning in this space. You could also look to LlamaIndex API for naming inspiration.
zacmps · 2 years ago
I've been using 'build' as in, 'build a dataset'. I think it gets the same idea across without being as easy to confuse with fine-tuning.
hrpnk · 2 years ago
peheje · 2 years ago
Many of these AI "products" - Is it just feeding text into LLMs in a structured manner?
wruza · 2 years ago
Why quotes. Your file manager is just feeding readdir into tableViewDataSource, and your video player is just feeding video files into ffmpeg and then connects its output to a frameBuffer control. Even your restaurant is just feeding vegetables into a dish. Most products "just" do something with existing technologies to remove the tedious parts.
okwhateverdude · 2 years ago
Basically, yeah. It is shockingly trivial to do, and yet like playing with alchemy when it comes to the prompting, especially if doing inference on the cheap like running smaller models. They can get distracted in your formatting, ordering, CAPITALIZATION, etc.
jedberg · 2 years ago
This is awesome. It's a quick turnkey way to get started with RAG using your own existing SQL database. Which to be honest is what most people really want when they say they "want ChatGPT for their business".

They just want a way to ask questions in prose and get an answer back, and this gets them a long way there.

Very cool!

zainhoda · 2 years ago
Author of the package here. Thank you! That's exactly what we've seen. Businesses spent millions of dollars getting all their structured data into a data warehouse and then it just sits there because the people who need the information don't know how to query the database.