SQL needed structure - Readit News

with movieIds as (select id from Movie where title = $1), actorIds as (select Actor.id from Actor join ActorMovie on [...] where ActorMovie.Movie in movieId), alsoActedIn as (select id from ActorMovie where actor in actorId), movieResults as (select * from Movie where id in movieIds), actorResults as (select * from Actor where id in actorIds), alsoActedInResults as (select * from Movie join ActorMovie on [...] where ActorMovie.id in alsoActedIn) select * from movieResults full outer join actorResults on false full outer join alsoActedInResults on false;

MovieId,MovieTitle,ActorId,ActorName,MovieId,MovieTitle,ActorMovie.actor 0,"Indiana Jones",null,null,null,null,null null,null,0,"Harrison Ford",null,null,null null,null,null,null,0,"Indiana Jones",0 null,null,null,null,1,"Star Wars",0 null,null,null,null,2,"The Fugitive",0

[{ id: 0, title: "Indiana Jones", actors: [{ id: 0, name: "Harrison Ford", alsoActedIn: [ {id: 1, title: "Star Wars"}, {id: 2, title: "The Fugitive"}, ] }] }]

select Movie { title, actors: { name, alsoActedIn := ( .movies { title } filter .id != Movie.id ), }, } filter .title = <str>$title;

I used to work bit with MongoDB (over 10 years ago) and it did wonders in making me fall back into love with relational databases. I finally saw the light of Codd!

SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.

Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.

andyferris · 3 days ago

100%.

The shame, to me, is that SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data. The relational model allows for any given value (field or cell value) to be itself a relation, but SQL doesn't make it easy to express such a query or return such a value from the database (as Jamie says - often the API server has to "perform the join again" to put it in nested form, due to this limitation).

bazoom42 · 3 days ago

> The relational model allows for any given value (field or cell value) to be itself a relation

First normal form explicitly forbids nested relations though. Relational algebra does not support nested relations for this reason.

But perhaps nesting relations might make sense as the final step, just like sorting, which is not supported by the pure relational model either.

strbean · 3 days ago

SQL is unnecessarily clunky in just about every respect. It's mind boggling that the syntax of a language designed in the 1970s has so many ardent defenders.

zozbot234 · 3 days ago

> SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data.

Property Graph Query (PGQ) is now a part of SQL and is expected to help with expressing these complex queries.

mycall · 2 days ago

There are lots of ways in SQL to provide nested/heirarchical data, each with their own strengths and weaknesses.

* Adjacency List Model

* Path Enumeration Model, also known as the Materialized Path

* Closure Table Model (or Bridge Table)

* Nested Set Model

* Recursive CTEs

arnsholt · 3 days ago

My stint with MongoDB was brief, but I too came away with a deeper appreciation of SQL bases. I feel it's a bit like a good type system: yes, there absolutely is an upfront cost, but over time it really does save you from stupid runtime problems. It's especially important when the bugs are in the data storage layer, because if a bug causes bad data to be written to your database, not only do you need to find and fix the bug, you also have to figure out how to deal with (possibly lots of) bad data.

Also, modern SQL is an incredibly powerful language. Good SQL can save you from lots of dumb data munging code if you know how to wield your database properly. Especially for analytical queries, but also more typical application code IMO.

anonymars · 3 days ago

I have spent many years beating the drum that a few minutes spent on constraints today save many, many, many hours of painful data cleanup later

And in fact, good constraints also can improve query performance. If the optimizer knows this column is unique, or that column is guaranteed to have a corresponding value in the joined table, it can do all sorts of tricks...

da_chicken · 3 days ago

Yes, nothing will make you appreciate the relational model more than using a database where the designer broke first normal form. Or where the developer thought the data in the database was theirs and it was only ever going to be used by just their application.

Something else I find confusing is that every developer seems to want every database query for a single object to return a single row with a thousand columns, and anything multi row or with multiple results is way too complicated to handle. This goes double for report writing software.

I really wonder what we're doing with database providers (drivers) that makes people want to develop ORMs when ORMs consistently feel like 10 ton gorillas. If the database world is so disparaging of the "row by excruciating row" processing, why do the drivers exclusively present data that way and no other?

hdjrudni · 3 days ago

I can't figure out why when I do `t1 left join t2` and it's not 1:1 or 1:0, I get duplicate t1 rows. Why can't either the driver or SQL find a way to represent a single t1 with multiple t2 (this is of course trivial to represent in JSON or any programming language with arrays and dicts).

I don't mind that the tables in relational databases are essentially 2d matrices. I can do the hierarchy with the joins. I just need a way to get the data out in a sane format.

Last time I brought this up ~15 years ago everyone hated on me. But everyone knows this is ultimately the data we want to get out of the system.

mrlongroots · 3 days ago

Agree: flat/relational structure is highly efficient to store, query planning/optimization/scaling become a lot easier, the language is clunky but relational algebra is beautiful, but...

I think pushing multiple joins to what is still a relational database and getting a complex output isn't the worst idea in the world, as a higher-level layer on top of a regular database.

On the other hand, "it needs four queries/RTTs" is not the worst thing in the world. It needn't be the goal of a system to achieve theoretical minimum performance for everything.

Let those who truly have the problem in prod push the first patch.

Yes - I have to agree.

Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.

The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).

This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.

[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

9rx · 3 days ago

> The article is right that this idea was taken too far

The biggest mistake was thinking we could simply slap a network on top of SQL and call it a day. SQL was originally intended to run locally. You don't need fancy structures so much when the engine is beside you, where latency is low, as you can fire off hundreds of queries without thinking about it, which is how SQL was intended to be used. It is not like, when executed on the same machine, the database engine is going to be able to turn the 'flat' data into 'rich' structures any faster than you can, so there is no real benefit to it being a part of SQL itself.

But do that same thing over the network and you're quickly in for a world of hurt.

Mikhail_Edoshin · 3 days ago

But the very title of Codd's paper mentions shared data banks, doesn't it? Concurrent access and thus networking was there from the beginning. One of reasons SQL is the way it is (declarative) is because it shields the user from the underlying concurrency: there is no notion of it at all, to the user things look as if he was the only client of the database, while in reality he's just one of many.

reaanb2 · 3 days ago

In my view, the O/R impedance mismatch derives from a number of shortcomings. Many developers view entities as containers of their attributes and involved only in binary relationships, rather than the subjects of n-ary facts. They map directly from a conceptual model to a physical model, bypassing logical modeling. They view OOP as a data modeling system, and reinvent network data model databases and navigational code on top of SQL.

mcphage · 3 days ago

In that case, the mismatch is between "What developers need" and "What SQL provides".

kragen · 3 days ago

I agree about first-class nested relations, but I don't agree about transactions.

Codd was writing 10 years before the idea of transactional semantics was formulated, and transactions are in fact to a great extent a real alternative to normalization. Codd was working to make inconsistent states unrepresentable in the database, but transactions make it a viable alternative to merely avoid committing inconsistent states. And I'm not sure what you mean by "quick", but anything you could do 35 years ago in 10 milliseconds is something you can do today in 100 microseconds.

andyferris · 3 days ago

It's not about just _transactions_. What you wrote is 100% correct.

It's specifically about _fast_ transactions in the OLTP context. When talking about the 1970s (not 1990s) and tape drives, rewriting a whole nested dataset to apply what we'd call a "small patch" nowadays wasn't a 10 millisecond job - it could feasibly take 10s of seconds or minutes or hours. That a small patch to the dataset can happen almost instantly - propagated to it's containing relation, and a handful of subordinate index relations - was the real advance in OLTP DBs. (Of course this never has and never will help with "large patches" where the dataset is mostly rewritten, and this logic doesn't apply to the field of analytics).

Perhaps Codd "lucked out" here or perhaps he didn't have the modern words to describe his goal, but nonetheless I think this is why we still use flat relations as our systems of record. Analytical/OLAP systems do vary a lot more!

bccdee · 3 days ago

My favourite way of doing this is

Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this:

With a bit of client-side logic, it's easy to transform that into this:

without the need to use json as a middleman.

This pattern has saved me from some truly awful query logic.

timeinput · 3 days ago

I've been trying that, and I just keep running into miserable query performance.

What order of magnitude do you use this on and find it acceptable? 1MB, 1GB, 1TB, 1PB? At 1GB it seems okay. At 10GB things aren't great, at 100 GB things are pretty grim, and at 1TB I have to denormalize the database or it's all broken (practically).

I'm not a database expert, but I don't feel like I'm asking hard questions, but I'm running into trouble with something I thought was 'easy', and matches what you're describing.

Try using the `explain analyze` statement. My guess is, either you need to add indexes, or you're not using the indexes you do have because some of your subqueries are being materialized when they shouldn't be. I can't be sure of the specific problem, but you should be able to fix it with some run-of-the-mill query optimization.

kaelwd · a day ago

My favourite way of doing this is with edgeql

gerad · 3 days ago

couldn't you do a union all instead of outer join on false?

Nope, because unions merge columns. `select * from Actor union select * from Movies` gives you [(0, "Harrison Ford"), (0, "Indiana Jones")], which is a problem because we can't tell which rows are actors and which rows are movies. What we need is [(0, "Harrison Ford", null, null), (null, null, 0, "Indiana Jones")].

lixtra · 3 days ago

Yes, you could use union. But then you have to pad the columns of the other tables with NULLs to arrive at the same output and carefully count. And we all hate counting.

taffer · 3 days ago

> With a bit of client-side logic, it's easy to transform that into this [example here] without the need to use json as a middleman.

The result in your example looks exactly like JSON. Am I missing something?

Also, what is the point of avoiding JSON? Your client has to unmarshal the result either way.

The json in my example would be generated in client code from the rows returned by the query, which I represented as CSV. The client's representation could just as easily be objects or protobufs or whatever, but I figured json would be a convenient way to portray nested data.

It's worth avoiding json on the wire if possible, because the extra layer of encoding will complicate things. Any good postgres integration will know how to safely deserialize and discriminate between (e.g.) DateTime and DateTimeWithTimeZone. But with json, everything's stringified.

greggyb · 4 days ago

Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".

This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:

sinfulprogeny · 3 days ago

Kleppman[1] calls it schema-on-read (json, xml) and schema-on-write (typed columns in an RDB). I like it over structured/unstructured, it's a bit more specific.

[1] https://martin.kleppmann.com/2017/03/27/designing-data-inten...

greggyb · 3 days ago

Yes, I agree with that preference. I don't love the verbiage of "structured" / "unstructured" in either usage (the article's or that which I shared).

thaumasiotes · 3 days ago

> Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

Mark Rosewater likes to write about his personal concept of "linear" Magic: the Gathering decks, which are decks in which the cards tend to pick up synergy bonuses from each other, so that having two of them together is considerably more powerful than you'd expect from the strength of those two cards individually.

This always bothers me because it is the opposite of the more normal use of "linear" relationships, in which everything contributes exactly as much to an aggregate as it's worth individually.

stevage · 3 days ago

I think the SQL sense is more that "structured" means "it contains internal structure" (like a tree), whereas a table without JSON is free from additional structure apart from its own table structure.

stared · 3 days ago

mrkeen · 3 days ago

The author didn't see fit to mention that you just flip the arrows.

Classes/Structs know about their children.

Relations know about their parents. If you want crazier m*n relationships you use an association table.

Did the author just not know? Or he didn't see it worthy of dismissal?

> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.

It really is.

> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.

Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.

tucnak · 4 days ago

> All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page.

This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.

taffer · 4 days ago

This is also the conclusion of the article. The author then shows an example of how to get hierarchical json out of your relational database.

JSON is very helpful here, and it _happens_ [1] that our HTTP APIs tend to respond with JSON, but I feel SQL's data model is much better suited. The difference is that JSON is hierarchical but SQL relations are flat. However if we let a relation be contained as a value (in a field/cell) then we should be able to produce our hierarchical query result in a natural way (and possibly map that through a `toJSON` method on the API server, if the client wants application/json... otherwise the server might want to do grpc or whatever and getting JSON from the DB is a roundabout pain in the bum).

[1] Actually causality is backwards here - postgres supports JSON only because the Restful API servers that frequently interact with it need to provide JSON to _their_ clients... and so it was a highly demanded feature.

MaxMonteil · 4 days ago

This sounds very interesting, I've been using SQL more lately and am constantly impressed by what it can just do.

My thinking is everything I could get done by the DB avoids heavier and maybe slower application code.

Do you have some resources or material I could check to learn more?

Putting business logic in the database: https://sive.rs/pg

Example code: https://github.com/sivers/store

This talk is more Oracle specific but with some minor changes you can apply the same ideas to postgres: https://www.youtube.com/watch?v=GZUgey3hwyI

Personally I am using an architecture similar to https://sive.rs/pg in my personal projects and I am very happy with it. It is important that you put data and procedures/views in different schemas so that you can use migrations for your data but automatically delete and recreate the procedures/views during deployment. Also use pgtap or something similar for testing.

fredguth · 4 days ago

Today I Learned the power of LATERALs... thanks. :-)

bob1029 · 3 days ago

> Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.

Much of this discourse around SQL "not having structure" seems to be about arrogance rather than ignorance. It would take 10 seconds with ChatGPT to resolve this deficit, but for some reason we insist on writing entire blogs and burning hundreds of aggregate hours commenting about a make-believe world wherein views and CTEs don't exist.

An interesting thing about "A Relational Model of Data for Large Shared Data Banks" (the stone tablets handed down from Codd https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) is that it starts out by considering N-ary relations whose values may themselves be relations, and only then switches to considering "normalized" relations whose values are not relations:

> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.

But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.

I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.

cryptonector · 3 days ago

^F case-insensitive search for:

  - cte
  - recursive
  - with (in SQL code, not prose)
  - connect (as in Oracle's CONNECT BY)

nothing. Nothing!

SQL deals with hierarchical data just fine. You just have to learn about how (recursive CTEs).

bwestergard · 8 hours ago

I've recently been learning about Datalog, which is excellent for all the queries that you'd want to do with recursive CTEs, but has the added benefit of making it much easier to factor one's code into discrete units (relations) for reusability.