Building AI agents to query your databases

bob1029 · 9 months ago

> This abstraction shields users from the complexity of the underlying systems and allows us to add new data sources without changing the user experience.

Cursed mission. These sorts of things do work amazingly well for toy problem domains. But, once you get into more complex business involving 4-way+ joins, things go sideways fast.

I think it might be possible to have a human in the loop during the SQL authoring phase, but there's no way you can do it clean without outside interaction in all cases.

95% correct might sound amazing at first, but it might as well be 0% in practice. You need to be perfectly correct when working with data in bulk with SQL operations.

ozim · 9 months ago

A bit of a joke here:

It solves 100% cases where some manager requests dashboard never to look at it again after one day.

aargh_aargh · 9 months ago

There's no smoke without fire.

re-thc · 9 months ago

> A bit of a joke here:

> It solves 100% cases where some manager requests dashboard never to look at it again after one day.

Better than having engineers build it for days (for the same result).

mritchie712 · 9 months ago

Using a semantic layer is the cleanest way to have a human in the loop. A human can validate and create all important metrics (e.g. what does "monthly active users" really mean) then an LLM can use that metric definition whenever asked for MAU.

With a semantic layer, you get the added benefit of writing queries in JSON instead of raw SQL. LLM's are much more consistent at writing a small JSON vs. hundreds of lines of SQL.

We[0] use cube[1] for this. It's the best open source semantic layer, but there's a couple closed source options too.

0 - https://www.definite.app/

1 - https://cube.dev/

zerop · 9 months ago

Currently exploring cube for a "natural language to SQL" solution.

My schema is - 90+ Tables, 2500+ Columns, well documented

From your experience, does Cube look a fit? My use cases will definitely have JOINS.

efromvt · 9 months ago

JSON generation against a semantic layer and validation loops is definitely the easiest way to get high 9s success for going directly to a correct query from text. For the human in the loop cases, going directly to SQL can be fun - I’ve toyed with a SQL-like semantic layer that removes the need for direct table access and joins, which removes two of the risk points for LLMs going off the rails while still leveraging a lot of the baked in knowledge about SQL syntax (window functions, transformations, etc) that can be hard to exhaustively bake into the semantic layer. (It’s annoying when the semantic layer doesn’t quite have what you need.)

tillvz · 9 months ago

Completely agree! A semantic layer is essential for scaling analytics to enterprise complexity.

Another alternative here is Veezoo [0], which combines the semantic layer (a Knowledge Graph) and self-service analytics into one integrated solution.

We built it specifically for the analytics use-case for both the "data persona" to manage the semantic layer, as well as for the "business persona" to analyze the data.

If you’re looking for a semantic-layer + (embedded) BI solution right out of the box. This could be a fit.

0 - https://www.veezoo.com

pbronez · 9 months ago

How well is that working for you?

We use a pattern where we ETL things into tables that model the upstream source closely, then use SQL Views to tighten up the model and integrate across data sources where needed. Keeping this all inside one DB allows us to use tools that understand the schema for autocomplete, etc.

I expect the developer experience would be significantly worse if we started writing views in YAML instead of SQL… but you’ve found the opposite?

spolu · 9 months ago

Hi, you are right that things can go sideways fast. In practice, the data that the typical employee needs is also quite simple. So there is definitely a very nice fit for this kind of product with a large number of use-case that we do see provide a lot of value internally for employees (self access to data) and data scientist (reducing loads).

For complex queries/use-cases, we generally instead push our users to create agents that assist them in shaping SQL directly, instead of going directly from text to result/graphs. Pushes them to think more about correctness while still saving them tone of time (the agent has access to the table schemas etc...), but not a good fit for non technical people of course.

abirch · 9 months ago

This works well until it doesn’t. As long as there is someone who is responsible for the data correctness. E.g. the cardinality of two joining tables maintained cardinality instead of: there’s currently no one in the system in with two locations in the employee_to_location table so it works right now. One it happens there will be the wrong employee count from this query

ritz_labringue · 9 months ago

It does require writing good instructions for the LLM to properly use the tables, and it works best if you carefully pick the tables that your agent is allowed to use beforehand. We have many users that use it for every day work with real data (definitely not toy problems).

iLoveOncall · 9 months ago

If only we had a language to accurately describe what we want to retrieve from the database! Alas, one can only dream!

troupo · 9 months ago

> It does require writing good instructions for the LLM to properly use the tables

--- start quote ---

prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

- training set

- weights

- constraints on the model

- layers between you and the model that transform both your input and the model's output that can change at any time

- availability of compute for your specific query

- and definitely some more details I haven't thought of

https://dmitriid.com/prompting-llms-is-not-engineering

--- end quote ---

spolu · 9 months ago

Yes you are perfectly right. Our product pushes users to be selective on the tables they give access to a given agent for a given use-case :+1:

The tricky part is correctly supporting multiple systems which each have their own specificity. All the way to Salesforce which is an entirely different beast in terms of query language. We're working on it right now and will likely follow-up with a blog post there :+1:

Deleted Comment

fsndz · 9 months ago

some people say agents can help. but still difficult to pull it without robust human in the loop: https://medium.com/thoughts-on-machine-learning/build-a-text...

gcanyon · 9 months ago

My general method for things like this is to:

1. Get a .dot file of the database. Many tools will export this. 2. Open the .dot in a tool I built for the purpose. 3. Select the tables I'm interested in, and export a subset of the .dot file representing just those tables and relationships. 4. Hand that subset .dot file to the LLM and say, "given this schema, write a query -- here's what I want: <rest of the request here>"

That gets the job done 60% of the time. Sometimes when there's an incorrect shortcut relationship resulting in the wrong join I'll have to redirect with something like, "You need to go through <table list> to relate <table X> to <table Y>" That gets my success rate up above 95%. I'm not doing ridiculous queries, but I am doing recursive aggregations successfully.

totalhack · 9 months ago

You are doing a lot of the work a semantic layer would do for you. I wonder if you would have better luck having the LLM talk to a semantic layer instead of directly to the database.

jt_b · 9 months ago

Can you talk more about what an implementation of such a semantic layer would look like?

Der_Einzige · 9 months ago

Combining this with structured/constrained generation with grammars/pydantic can supercharge this btw

gcanyon · 9 months ago

Can you illustrate this a little? Or should I be asking an LLM for advice? :-)

breadwinner · 9 months ago

Something like this is what I do also. Is there another way? How are others doing it?

dimitri-vs · 9 months ago

Export DDL of one or more relevant tables, run query to get a random sample of 5 records for each table. Really quick to gather all this and enough context to handle most query writing tasks with some guidance.

mritchie712 · 9 months ago

We (https://www.definite.app/) solved this a bit differently.

We spin up a data lake and pipelines (we support 500+ integrations / connectors) to populate the data lake for you then put DuckDB on top as a single query engine to access all your data.

yonl · 9 months ago

This is really interesting. At my previous company, I built a data lakehouse for operational reporting with recency prioritization (query only recent data, archive the rest). While there was no LLM integration when I left, I've learned from former colleagues that they've since added a lightweight LLM layer on top (though I suspect Dustt's implementation is more comprehensive).

Our main requirement was querying recent operational data across daily/weekly/monthly/quarterly timeframes. The data sources included OLTP binlogs, OLAP views, SFDC, and about 15 other marketing platforms. We implemented a datalake with our own query and archival layers. This approach worked well for queries like "conversion rate per channel this quarter" where we needed broad data coverage (all 17 integrations) but manageable depth (reasonable row scanned).

This architecture also enabled quick solutions for additional use cases, like on-the-fly SFDC data enrichment that our analytics team could handle independently. Later, I learned the team integrated LLMs as they began dumping OLAP views inside the datalake for different query types, and eventually replaced our original query layer with DuckDB.

I believe approaches like these (what I had done as in house solution and what definite may be doing more extensively) are data and query-pattern focused first. While it might initially seem like overkill, this approach can withstand organizational complexity challenges - with LLMs serving primarily as an interpretation layer. From skimming the Dustt blog, their approach is refreshing, though it seems their product was built primarily for LLM integration rather than focusing first on data management and scale. They likely have internal mechanisms to handle various use cases that weren't detailed in the blog.

nicholaschiang · 9 months ago

Do you host the data lake yourself? How do you handle data privacy?

Some large enterprises won't allow their data to leave e.g. their own Snowflake warehouse. All tooling they use has to run on top of their warehouse.

gavinray · 9 months ago

Interesting -- I work on a similar tool [1], and the JSON IR representation for your query is similar to the internal IR we used for our data connectors.

[1] https://promptql.hasura.io

lennythedev · 10 months ago

Just messed around with the concept last week. With a good enough schema explanation, bigger reasoning models did an amazing job. Definitely something I'm going to use in my dataflows.

mattpope · 9 months ago

I feel like this misses the most useful technique for this (or really glossed over any techniques at all): flat denormalized schemas. If you have an ENUM column, much like a human the LLM won't have enough context to query it. And much like a human, 4 way JOINs get tricky. Having one large sparse maxtrix-like table with many boolean columns like `is_sale_complete`, `is_sale_cancelled` data warehouse style is straight up the key to effectively use an LLM to generate SQL.

ilaksh · 9 months ago

On the off chance anyone is interested, I just built a Supabase plugin for MindRoot: https://github.com/runvnc/mr_supabase

MindRoot: https://github.com/runvnc/mindroot

Demo of agent using it to update a database after doing research: https://www.youtube.com/watch?v=nXwwSj0KSXI

If you give it the actual database password it will use a Postgres client to get the actual information_schema and put it in the system message. It uses the normal Supabase library for everything else.

gergely · 9 months ago

How do you solve the following problems? 1. Whenever you run a natural language question the result will be the same for the same question? 2. How do you teach an agent/LLM for the datamodel of the application that stores the data in the databases?

zurfer · 9 months ago

1. By having one obvious documented way of answering the questions or accepting that humans would also come to different results.

2. Investing in good data models and documentating the edge cases. Also learning from explicit feedback is powerful.

This helps our customers at getdot.ai get highly reliable results.

The models also got a lot better at making sense of weird and complex data models in the last months.