I think the problem of this essay is that it's overly technical: only those versed well enough in SQL will really care to read the whole thing, and if they are already at that level, either they accepted that "SQL will get the job done in the end", or they learned to live along it and now even kinda embrace it, and are happy to write about how the examples are very poor and dismiss the critique based on that, when the essay kinda explains it main point pretty well:
>> The core message [...] is that there is potentially a huge amount of value to be unlocked by replacing SQL
To me, a lot of people defends SQL saying that "perfect is the enemy of good" and that SQL simply works. Not the favourite of anyone, but everyone kinda accepts it.
And yeah, it's true. People use SQL because it's good enough, and trying to reinvent the wheel would take more work (individually speaking) than just dealing with SQL as it is right now. For large organizations where the effort could be justified, all your engineers already know SQL anyway, so it's not so great either.
But for something so relevant as relational databases, perfect is not the enemy of good. We do deserve better. We generally agree that SQL has many pitfalls, it's not great for any kind of user (for non-technical users, a visual programming language would work well here, more like what Airtable does, closing the bridge between spreadsheet and hardcore database, and for technical users, it does feel unwieldy and quirky). We should be more open to at least consider critiques and proposals for better. We might find out that people, from time to time, are making some good points.
I think I'm experienced enough to understand the article, and I agree. I've written multiple optimizing SQL generators (altering generated SQL to access better plans), and rewritten hundreds of queries for better performance, which involves trying many semantically identical rewrites of the same query.
I agree with Jamie. I think SQL is irritatingly non-composable, many operations require gymnastics to express, and I'd like a more expressive and regular language to write queries in.
I also suspect that such a language would be harder to optimize, and might be more practical to implement if it was less declarative and closer to the plan.
I also think that as you scale up, offloading computation to the database is a false economy. It's closer to the data, but the database itself is a bottleneck. As you scale, you want to limit the complexity of your queries. But this is actually an argument for things like Materialize, i.e. making copies of the data elsewhere that bake in some of the computation ahead of time.
> I think SQL is irritatingly non-composable, many operations require gymnastics to express
One approach to radically simplify operations with data is to use mathematical functions (in addition to mathematical sets) which is implemented in Prosto data processing toolkit [0] and (new) Column-SQL [1].
I always thought that was what hadoop had in mind with HDFS, i.e. move the storage and query engines closer in terms of spatial and temporal locality, albeit only if you partitioned/clustered properly.
Speaking of which, what has happened to hadopp and HDFS? Used to be flavor of the month about 10 years ago, but now I hardly ever hear people talk about it.
SQL is exactly like the QWERTY layout: A first quickshot with little design thoughts and unfixable architectural issues that‘s so widespread that everyone is used to it by now.
Trying to change to the Dvorak layout taught me a lot about enacting change on such a grand scale.
After a lot of hassle switching machines and OSes, typing on other user‘s computers, them typing on mine and general headaches among internationalization and inaccessible key combinations, I switched back to QWERTY in the end.
As said in other threads, there is no shortage of attempts to replace SQL. A lot of them are pretty good. But having learned SQL the hard way, I feel zero urge to learn another language from scratch right now.
It‘s why nearly all big databases eventually switched to it despite the FORTRAN vibe and its general ugliness.
Anyway, probably time to SET TRANSACTION LEVEL READ UNCOMMITTED and call it a day ^^
While I agree with your point, I wonder why that does not seem to be the case with programming languages.
For example, in iOS development (and, more in general, on Apple platforms), there has been a huge shift from Objective-C to Swift.
The same arguments should apply there. Swift is much better, but Objective-C got the work done, and many codebases were written in it, especially at Apple. And yet, the whole community switched pretty quickly.
One could argue that Swift was easier to pick up for newcomers. While that's true, I would then expect the argument to apply also to SQL alternatives.
You're right that it got established early and network effects, but Codd had two goes at a relational query language before Chamberlin and Boyce had a go. So not literally "a first quickshot".
I think the fact that many people were willing to try switching to mongodb despite the fact that it doesn’t use sql and had quite poor semantics should suggest that there would be some way for a better relational database query language (and engine) to catch on.
On the other hand, the fact that many new time series databases (and other engines, including Materialize which the author of the article worked on) wanted sql (or fake sql-like languages when they couldn’t manage sql) suggests that not being sql can be a big hindrance.
I don’t know. If someone gives me an interface like Prisma, and the backing database does not use SQL, but is otherwise the same, that just removes one needless translation step, and I wouldn’t have to do anything.
I find the fact that others cannot borrow my computer to be a major plus of using non-QWERTY layouts. Your point about keyboard shortcuts is spot-on: forget hjkl navigation in vim if you’re not QWERTY.
I agree. I’ve seen people bashing SQL ever since I started using it 25 years ago, but I’ve yet to see an attempt to replace or supersede it succeed.
Many of these blog articles seem to be written by devs that have only really experienced the user-facing application side of things, and really don’t realise the sheer number of financial, analytic & business systems, that keep the world spinning round, that are all happily buzzing away using SQL.
Moaning about JSON manipulation in SQL is madness. A client should use appropriate joins to get the data from the database & then do the transformation. Storing JSON when you actually need to perform operations on said data makes 0 sense. It’s a huge overhead, an utter nightmare to index etc.
In theory datalog is better than SQL. However the reality is that we have millions of programmer-hours invested in making SQL perform well where datalog has no where near this effort.
I think that's why NoSQL is where the gains are being made. You can't magically improve things with a query language without putting the effort into performance, but you can get huge gains by changing your assumptions about how data is structured.
Inventing a new language and a new query engine and a new storage engine at the same time, competitive with the state of the art, is maybe just too much to be feasible.
There were multiple optimizing compilers which can do a lot of these asks, allowing composable queries that return nested types, but produce SQL queries. I think the pathfinder compiler had the most real-world use, it was meant to efficiently mix SQL and xquery to query SQL+XML documents stored in postgres. It had a c# linq frontend, but also fairly hefty compile times.
I think most of the criticism is about SQL not being the best programming language - a language in which pros build application, especially web applications. And SQL is truly not best at that. It is not the best API.
But in my world, SQL is much more of a Human Data Interface than Application Programming Interface. SQL is used for both ad-hoc and automated data querying and transformation. It is something manufacturing intelligence and supply chain experts and researchers and others learn to empower themselves to get to the data - and yes, you won't see any CEO running reports in SQL themselves, but it is not for programmers only.
Those people would not benefit from syntax with more brackets instead of words, easier way to define and call functions, or the ability to install libraries - in fact I think it would make it harder and less accessible for them.
The OP is right that for machine-to-machine interface, all the syntax baroqueness is not worth it. And of course, having more portability would be great.
But while machine-to-machine interfaces written by skilled developers who knows exactly what are they doing might actually be the most common use of SQL (since half the internet runs on Wordpress and MySQL and most smartphone apps might have SQLite in them), it is not where the majority of complexity or time is spent - that one is with the people working with data.
Replacing SQL is a tough sell because at its core, it really does work pretty well for what it was designed to do. (i.e. not most of the things they've bolted onto it and the corresponding databases this century) But it's not a silver bullet and people should stop trying to jam things into relational databases that aren't a good fit. There are already a number of potentially useful alternatives out there depending on your particular problem.
In the event that nothing exists that is right for the job: don't worry about replacing SQL, worry about creating/using a matched DSL/API/(whatever is appropriate) and data store for the particular problem you're trying to solve that SQL isn't a good fit for. If you create something that solves its particular problem even half as well as SQL did for relational databases you'll probably have a winner in that space that people with the same problems would be happier using.
To me, SQL looks like something I should be using 79-char punchcards for.
Scalable databases are just so difficult that we’re still driving a ‘64 IMPALA
Most of this opinion comes from “SQL” being vendor-specific. Is JSON vendor-specific? Is anything else, that we actually use by choice?
Mad at you too, Graph DBs, for sending us on another snipe hunt by adding vendor-imposed innovations, because it makes the enterprise marginally profitable. It’s how the world works, I’m still not happy about it.
(Disclaimer: I possibly just traveled back to 2002 and said this on slashdot)
> Most of this opinion comes from “SQL” being vendor-specific. Is JSON vendor-specific? Is anything else, that we actually use by choice?
Two things that come to mind are Markdown with all its flavors and Regex with multiple engines.
edit: and to a lesser extent maybe C/C++ compilers and JS engines.
edit2: also JVM, Python and Ruby runtimes
But both edits describe technologies with an official spec and slightly different implementations. Markdown/Regex are more comparable to SQL because they have vendor-specific syntax.
You're mixing up different things that don't nescessarily need to be addressed at the same time. SQL is the declarative programming language used to express what data you need.
What type of database engine is used to execute that query is independent of the language. It could be highly scalable or it could be focussed on single user, single process.
Totally! SQL is like shell, c and js that somehow generate this "never ever try to improve them for real, lets stay suffering for all eternity!"
And the AMOUNT of energy in mask them (transpilers, linters, "good practiques", tricks, etc) is a monumental waste of effort that cost so much than actually go and fix the core issues.
And in this case, SQL is the most trivial to fix. It could get simplified and streamlined far easier that is to replace C, for example.
And a lot of interactions to the DBs are using bridges already. Create a "good sql" is similar to create WASM and then for legacy reasons support in EOL fashion all the old sql and all the new go against the new way.
But this mean at least 2 or 3 major database vendors go for it (I dream: sqlite & postgresql).
P.D: I'm exploring building a relational language, so already have a simplified dialect for it: https://tablam.org/tutorial
Well there were a lot of atempts to relace SQL. And there will be more. But untill then SQL gets job done®. By the way, why "replace" let people choose. If you have better tool for the job, pick that over SQL.
There are probably a few of us who simply disagree. The fact that there aren’t libraries for SQL is a huge advantage in my book, and I say that as someone who’s overseen his developers move more and more from C# to Python because it has libraries that are simply easier to work with.
In this you might be able to guess my objection to many of OPs points. Because the key thing we look for when we decide to implement tech isn’t how good it is, but how easy it is to work with and more importantly, how easy it is to find other people who can pick up when someone’s career take them forward in life.
SQL is the one technology that has been stable enough over the years that no matter who we hire, we can always count on them being capable of working with our Databases. Yes, some of the more expansive functions, especially some of the recursive ones, take a little time to get accustomed to and there will always be some back-and-forth preferences in regards to stores procedures, but on a whole, it’s the one technology that never causes us any pain and never requires retraining.
I’m sure it’s not efficient, and OP is completely correct about the parts about the untapped potential, and I fully support people developing rational alternatives to SQL, I’m simply pointing out why SQL has been the power horse it has since, well basically, since we moved beyond punch cards and why it’ll likely still be so in 50 years.
Because at the end of the day, technology doesn’t really matter to the business. Cost vs benefit does, and as long as it’s easier and therefor cheaper to have people work with SQL (in the big picture), it’ll be what businesses work with.
The only reason you don’t see something similar in declarative programming (or even functional) is really beyond me but it probably has to do with how things evolved and how much easier it is to change that part of the tech stack compared to your databases. If we get a new hire who can do magic with JavaScript/Python/whatever we may consider allowing him/her to do that. We don’t want to, again because the big picture is easier to manage with fewer technologies, but it’s much easier to justify the cost-benefit of that compared to someone wanting to use a different database than the one in our managed cluster. Already you need more staff than that person to maintain it, as developers don’t run database server operations, aren’t part of the contingency staff and so on.
Like I said, I fully encourage the direction OP wants to go. Enterprise rarely invent things, we’d be happy with ball frames if that’s what made us the most money, so we need people like OP to invent things for us to adopt.
I completely agree with you. Is SQL perfect? No. Have I accepted and embraced it? Yes, because it'll get the job done.
I also happen to really dislike how the author hasn't capitalized the syntax like SELECT FROM WHERE or CREATE TABLE which, to me, poorly affects the legibility and therefore makes me less interested in reading the argument overall.
Among the various languages I use, why is SQL the only one that favors ALL CAPS EVERYTHING? When writing ad hoc queries I ignore that convention just to be ornery.
I wonder how much of the limitation are necessary in order for the query optimizer to have any chance at finding a good execution plan. As you add more and more abstractions and more and more general computations in the middle of your queries, it will probably become harder and harder for the query optimizer to understand what you are actually trying to do and figure out how to do it efficiently. Are you not running the risk that the database will have to blindly scan through entire tables calling your user defined functions as black boxes on each row?
I would also guess that we could have a better SQL but I do not think it could and should look anything like a general purpose programming language because otherwise you might get in the way of efficient query execution. Maybe some kind of declarative data flow description with sufficient facilities to name and reuse bits and pieces.
And maybe you actually want two languages which SQL with its procedural parts already kind of has. One limited but well optimizable language to actually access the data, one more general language to perform additional processing without having to leave the server. Maybe the real problem of SQL lies mostly in its procedural part and how it interfaces and interacts with the query part.
> I wonder how much of the limitation are necessary in order for the query optimizer to have any chance at finding a good execution plan.
Probably in exactly the opposite way: the limitations of SQL put a lot of work on the back of the query optimiser without allowing for said optimiser to easily reason about the queries, or for the writer to easily feed the optimiser (short of dedicated extensions e.g. Oracle's optimizer hints).
Or so I would think, I've never heard of SQL being praised for its optimisability.
I think it's important to distinguish between what's highly optimizable in theory and what's easily optimizable in practise. The latter working here and now, and the latter being a (possibly perpetual) decade of compiler development away.
An example here is how, sure, in theory, JITs can outpace AOT compilation because they have all the information the AOT compiler has plus runtime insights. But the ability to truly do that always seems to be a decade of compiler development away, with many giving up on the idea entirely.
It's also important to consider what we're comparing SQL's optimizability against. If it's against typical NoSQL databases, most of which seem to favour a lower-level query specification, I can defend SQL's optimizability to the end - with SQL databases having the freedom to dynamically switch algorithms to adapt to the actual contents of the database at the time of the query. Something which, ironically, a stale optimizer hint (i.e. written with the size & shape of the database as it was last year in mind) can actually get in the way of. Not that I'm saying that SQL planners never produce stupid query plans of course.
Optimizability is one of the core ideas of the relational model. The others being data independence and a declarative query language. SQL is based on relational calculus and relational algebra, which in turn are based on first-order logic. The reason that every attribute in the relational model must be atomic is that first-order logic cannot be applied to nested sets. However, first-order logic is what makes query optimization possible in the first place. Thus, if one were to develop a query language that did not have these constraints, then one would lose optimizability.
I think the criticisms of the article are basically right. SQL sucks in a lot of ways, and the base SQL standard really sucks such that virtually everyone has extended it at least somewhat, but they've all done it in a completely nonstandard way so nothing is portable, and the standard is never officially updated anymore. Oh also it's a massively leaky abstraction and you may have to tune your query to the database anyway, like even just normally, not even something ported, hence hints/etc.
There are quite a lot of pretty basic things many programmers would want to do that just are way more stupid than they should be. One that irks me is things like "return me the top 100 items", "return me/join against the most event for item X in a history log", etc) that end up requiring way more shit than they should just because there's no standards-compliant way to say "select first X rows from ... where ... order by ..." or "join first row where .. order by ...". In the case of top 100 you just wrap it in another query, but for more analytical stuff like "join the most recent X for this item" you have to use window functions and the syntax fucking sucks.
Since you mention optimization, perhaps it would help to allow the abstraction to be peeled away and write clauses that express imperative commands. Like being able to write that what you want is an "index join table X on index Y", etc. That's sorta what hints do, but roll them into the language in a portable way. It could also allow the query planner to look at it and tell you "that's just not possible and you need to check your keys/indexes/queries/etc", rather than having to run it and guess what it's trying to do.
Because I kinda feel that's a lot of the tension for SQL queries (beyond the QOL stuff that everyone knows is stupid). It's a guessing game. You're trying to write queries that will be planned well, but the query planner is not a simple thing that you can easily predict. The closest analogy is probably something like OpenGL/DirectX where you're just guessing at what behaviors might trigger some given heuristic in the driver and there are a lot of unseen weights and flags. There are "equivalent" SQL expressions that are much heavier than another seemingly similar one (like "select from ... join ... where exists (select ...)" vs "select where x in (select ...)". There are operations that are mysteriously expensive because you're missing an index or the query isn't getting planned the way you expect.
The suggestion of a "procedural" expression is, I think, also probably correct for some situations. PL/SQL functions are extremely useful, just obnoxious and in some cases arcane to write (as someone who never had formal education in DBA). It would be even nicer if you could have an embedded Python-style thing that let you have "python" syntax with iteration and list comprehensions and shit, that represent DB queries using cursors and shit, and perhaps defer execution until some final "execute" command while transforming it all into a SQL query. Like C#'s LINQ but for Python, but instead of buffering streams of objects transform it into a database query. Transform operations/etc on fields into SQL statements that work on columns, etc.
Or java if you will. Imagine a Java Streams API that compiles down to planner ops. I know Hibernate Criteria and JPA exists, but skip that API and express it as streams: iterations and transforms and so on, and map that onto the DB. Being able to build subqueries, then attach them to larger queries, etc. That way they execute in bytecode rather than pl/java.
Postgres allows you to run python in the database, Oracle has support for java, Postgres can do limit 100, you can build subqueries using views, and reuse these in larger queries...
Problem with query optimization is that it needs to be done at runtime, you can't optimize it in some procedure language easily. The optimal way to retrieve data depends on the number of records in your tables, the where clauses you use, the actual values you are filtering by, the records already in cache, etc.
99% of all programmers would not be able to program better performing queries when doing this with a procedural language or streams expressions or it would take them way too long.
I feel like most frustrations with SQL boil down to fighting against a shitty schema.
When you are sitting in a properly normalized database, it is a lot easier to write joins and views such that you can compose higher order queries on top.
If you are doing any sort of self-joins or other recursive/case madness, the SQL itself is typically not the problem. Whoever sat down with the business experts on day 1 in that conference room probably got the relational model wrong and they are ultimately to blame for your suffering.
If you have an opportunity to start over on a schema, don't try to do it in the database the first few times. Build it in excel and kick it around with the stakeholders for a few weeks. Once 100% of the participants are comfortable and understand why things are structured (related) the way they are, you can then proceed with the prototype implementation.
Achieving 3NF or better is usually a fundamental requirement for ensuring any meaningfully-complex schema doesn't go off the rails over time.
Only after you get it correct (facts/types/relations) should you even think about what performance issues might arise from what you just modeled. Premature optimization is how you end up screwing yourself really badly 99% of the time. Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching.
The categories were made for man, not man for the categories.
A not-shitty schema is always to be preferred over a shitty one, to be sure. But for any data schema, there will be queries which cut against unexpected joints (joins?).
And SQL is bad for this. The entire Fine Article is a detailed exploration of how it's bad for this. A decent query language would keep the easy things easy (SQL is ok at this) and wouldn't make the hard things pointlessly difficult or impossible-except-at-the-application-layer.
Sometimes it's just a weird or one-off query, and sometimes there are so many of them that migrating to a better schema is indicated. Since it's difficult to do the dozen or so things the article sketches out, it's also difficult to iterate your way to a better schema arrangement.
The sort of waterfall database design you advocate (which, yes, think long and hard about your data!) always breaks down, because building business logic is a process of discovery. There are always missing pieces.
None of them in particular. The author of this article seems mostly concerned with constructing pedantic strawmen and handily defeating them with clever demonstrations. Claiming SQL is not expressive is a fault on the part of the developer of the schema. The notion of a broader ER model and why a good one matters seems to have completely eluded the author, aside from a brief mention of 6NF (which is definitely not the right place to start the conversation on normalization).
It is really easy to take a negative stance on a technology and poke holes in it without having some grounding in reality. If you are dealing with a problem domain that has hundreds of types/facts/relations, and the need to author logic spanning all of these things all at once, you are going to have a hell of a time managing this with any other technology. Looking at SQL through the lens of toy problems is a massive mistake. You have to look at it through the lens of the worst problems imaginable to begin seeing the light.
I used to think SQL was some ancient trash until I saw the glory of a 40 table join in a factory automation setting. If you need dynamic, hard answers right now, SQL is the only option that makes sense at this kind of scale.
I felt it would contribute more meaningfully to the overall discussion on HN if I were to sidestep a direct critique and present a view of the problem from a different perspective.
Yeah, this has been my experience, too. Sql and RDBs are a lot less frustrating when someone takes the time to actually do some design and planning.
Personal experience incoming:
At startups, it's usually a mess because hiring someone who knows databases seems to always come so late in the game. At bigger corporations, well, hopefully the developers and database people get along and talk - otherwise, one of those teams is going to be a bottleneck.
> Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching
So true. This also extends into general purpose languages, everything is so much easier when you take the time to model things correctly.
Schemas and requirements evolve over time. There is no perfect schema.
Schema design always results in endless "are we gonna need this?" questions, and usually everything is postponed down the road because it adds needless upfront complexity.
All data are naturally triplets, and with the relational model we are forced to group the triplet attributes into tables. The difficulty arises because these groupings are often incorrect, and then difficult to change. There is always premature optimization, because the SQL query statements become insane with too much normalization.
A bigger problem is how sticky schemas are due to how difficult they are to change, and how difficult it is to update your code to support a schema modification.
I think these two problems need more attention.
We've gone too far down the code-first / code-generation ORM approach which makes tinkering difficult.
I think all database design should happen visually. MS Access style. You should design your schema alongside your data (as you suggested by using spreadsheets). But instead of then transferring that to code, it should be a living schema that connects to your IDE code refactoring. The more declarative queries are and the closer they are to what needs to be finally rendered the more chance of reliable refactoring. The more imperative code modifying the queried data, the more complex and difficult to refactor things become. A lot of the cruft in a codebase is when schema changes are additive because people are too lazy or too risk averse to refactor.
E.g. Think about a User having one address, and then needing to support multiple addresses. There's quick and dirty way by adding more columns or adding to a JSON column, or we add a new Address entity and keep the existing User address columns, or we delete those columns and migrate data to the new Address table - which is the cleanest way to represent this schema change. I think there are few who would do the last option though, and this is the problem. Hiding this behind an API also causes more problems down the line because our frontend data model starts to separate from our actual DB data model.
Without any stakes in the game, Intellij IDEA does a pretty good job at integrating the application being written with the db schemas it uses. It can autocomplete string SQL queries, and I think it can do database refactors - and though it doesn’t automatically does it at code site, it does add warnings when a string query becomes incorrect.
>The usual response to complaints about the lack of union types in sql is that you should use an id column that joins against multiple tables, one for each possible type.
>create table json_value(id integer);
>create table json_bool(id integer, value bool)
>create table json_number(id integer, value double);
No, the usual response is "Don't do that!"
99% of the time you either know the data types (so each JSON object becomes a row in a table where the column names are keys) or you don't know the data types and store the whole object as a BLOB
I'd be on board with adding unions to SQL, but I doubt I'd use that feature very often.
But the world just works like that. There are unions everywhere. No matter how it's implemented under the hood, it should really be a first class concept in any language, including SQL.
SQL already have first class support for union types in the form of relations. Adding union-type columns would actually be second class compare to this. You would have to add special-case operators and it wouldn't give you more power compared to just using relations.
That said, I'm not averse to the idea if someone can provide a realistic use case. The JSON example in the article is misguided though - you should not save the structure of the syntax of a serialization format, you should save the data model which is represented by the syntax.
I was talking specifically about the JSON example in the article. Needing to store objects in that manner is just a silly problem to have. Any solution will be slow or ugly or both.
You're right, unions are everywhere. Right now a human has to think about each union and how to represent it in a database. It would be really cool if I could store capnp objects like the one below and still get optimal query performance and aesthetics without thinking about it:
struct Shape {
area @0 :Float64;
union {
circle :group {
radius @1 :Float64;
}
rectangle :group {
width @2 :Float64;
height @3 :Float64;
}
}
}
I do love SQL and at least where I live (MS SQL Server) it can be made to run amazingly fast if you take some care with your queries and indexes. It's not portable though: as far as I know not a single one of the big sql vendors follows the standards 100% and more importantly, spending some time with one vendor will give you some habits that are sure to not work as well with another (cursor constructs are generally a death blow to performance in tsql but they are the way to do it on oracle for example). So I kind of agree with the author here.
But I also feel that maybe they are asking a bit much from SQL. The complaint that complex subqueries are complex... Well then don't use them? I would use WITH constructs in that situation because I find them easier to read but that's beside the point. I think its perfectly fine to pull out multiple result sets from simple queries and then do the complex stuff in your host language.
But this article is thought provoking to say the least. It follows the courtroom logic of holding the defendant SQL on trial for as much as possible. And SQL is guilty of a lot of crimes.
I do hope GraphQL and similar query languages become more prevalent and standardized, as it seems SQL could really use some stiffer competition.
SQL is often conflated with the relational model (hence the term NoSQL for non-relational databases), but the article is careful to explain that the relational model is great, but SQL is a clunky syntax/standard.
Going to graph databases is certainly throwing the baby out with the bathwater. The relational model was invented to address shortcomings in the hierarchical and graph database models.
I do not understand the comparison between SQL and GraphQL. As far as I can see they’re not close to the same. And GraphQL is also missing a bunch of really necessary things.
Are you also talking about switching from a relational model to something graph-based, or is there some way one could use GraphQL with a relational database?
GraphQL is stiff competition, in the sense of "stiff as a corpse". Yes, it has the words "query language" in the name, but that's where the similarities end.
In GraphQL, you can select fields of JSON objects, optionally a field may have parameters which affect the values it returns in unspecified ways. That's it. Because of this design, unlike in SQL where you are concerned with modeling the structure of your data, GraphQL also requires you to think about the API all users will use to access the data. In SQL, I can write a table defining the structure of my dataset, and then users of that table can perform arbitrary queries when they know what data they need (aggregate with a month average grouped by account ID, filter to only rows where x = 7, JOIN this to grab some data from some other table etc.).
GraphQL has no aggregates (sum, average...), no grouping, no joins, no sorting, no filtering, other than what you manually design each using parameters at schema design time. Good luck anticipating the use cases of every future consumer of your data. Miss one? Better dig back into your implementation code & implement that use case each & every time a new one comes up.
The only part of GraphQL that is standardized is the query syntax. In SQL, the actual underlying relational data model exists and the syntax of queries exists within that context, not so in GraphQL land. In SQL, I define my data structures, and users can write queries and access the data. But GraphQL throws up it's hands and says "not my problem, try one of these libraries that ask you to implement your own custom data access functionality for all your data types".
OK, so it's a rubbish query language, but even the graph part of the name is misleading. Assuming that you even have a domain that it makes sense to model with a graph of types, GraphQL provides you no tools for dealing with the backend complexity of such a design. Because the syntax is so simplified, there is no mechanism within the syntax to define rules about relationships between types. For example, imagine a simple parent/child relationship. There is no mechanism within the syntax to tell GraphQL that for parent X, parent.child = parent.child.parent . So you can't even think about writing a GraphQL query optimizer, because there isn't enough information about the structure of the data encoded into the schema or query to do so.
So in practice no GraphQL implementations that I know of have anything resembling a query optimizer - someone asks for cart.item, and then item.cart for a cart with 1000 items? Have fun re-requesting the cart from your DB 1000 times (yes you can cache the item by ID to save a DB lookup, but we shouldn't even need to hit cache here! Every programmer involved knows the data is the same, it's just dumb GraphQL has no clue about the details of the relationship).
My bet is that somepoint we are going to get very good NLP model where you feed tabular data, and then you are going query througt normal language. That would like be pretty big change and would eat pretty big share of sql market share.
For me the opposite is true. The value of multiple vendors with a similar product is that it drives prices down and liberates me from a lock in. I genrally avoid managed services that are not available from all three major cloud providers, and put abstractions between my suff and their so I can move my workloads around.
As to SQL. Its a weird feeling to read all that; I've spent 15 years working with very large relational databases -- and about 5 years ago I ditched it all in favor of using key-value object stores, wasting some storage but saving a metric ton of manhours of development work. Not looking back.
I agree with your last point to an extent. There's something about "inexpressiveness" that can actually be good, in that it requires you to simplify your data model. However, I imagine as it gets more complex, SQL becomes completely unwieldable. You basically have to use a NoSQL db as the author points out.
Ideally a RDBMS can also be a NoSQL DB. There is no reason it can't/shouldn't, if you have JSON-formatted columns.
Mostly the noSQL pattern is a mistake because you almost certainly have columns that recur reasonably frequently, but if you do have high-dimensional data or want to store processed documents/etc, you can represent them as a JSON/JSONB in postgres/etc, and even run indexes or queries on them.
Yeah, however it is kind of strange to single out SQL, when we have examples like a very famous kernel that can only make use of specific compiler, or Web APIs that are only implemented by a specific browser.
I don't see it as strange. They probably spend lots of time mucking around in SQL trying to force it to things it's not good at, so now they have strong opinions on SQL. They specifically state that it's too late to turn back the clock on SQL, but it's not too late to expose RDMS access with a better language, or with a sufficiently flexible API to allow arbitrary QL implementations. They aren't shitting on anyone here either. They say that the assumptions made when SQL were designed ultimately didn't pan out:
> The original idea of relational databases was that they would be queried directly from the client. With the rise of the web this idea died - SQL is too complex to be easily secured against adversarial input, cache invalidation for SQL queries is too hard, and there is no way to easily spawn background tasks (eg resizing images) or to communicate with the rest of the world (eg sending email). And the SQL language itself was not an appealing target for adding these capabilities.
> So instead we added the 'application layer' - a process written in a reasonable programming language that would live between the database and the client and manage their communication. And we invented ORM to patch over the weaknesses of SQL, especially the lack of compressibility.
SQL was not made for programmers alone. It has been invented also for not so technical people so that verboseness and overhead is part of the deal.
>When Ray and I were designing Sequel in 1974, we thought that the predominant use of the language would be for ad-hoc queries by planners and other professionals whose domain of expertise was not primarily data- base management. We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.
> We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.
In that case it has been an abject failure. I have been using SQL since the mid 1980s (so pretty much since the start of its widespread adoption) and I have never met "ordinary people" (by which I assume intelligent business-oriented professionals) who could (or wanted to) cope with it.
I like it, but the idea of sets does not come naturally to most people (me included). But I once worked with a programmer who had never been exposed to SQL - I leant him an introductory book on it and he came in the next day and said "Oh, of course, it's all sets, isn't it?" and then went on to write some of the most fiendish queries I have ever seen.
I don't think this is disputed that the original goal of SQL was a flop. The designers grossly underestimated the technical chops of a layman. However, I would argue that us tech people did benefit from that original goal of simplicity. I mean, SELECT first, last FROM employees WHERE id = 10 is not too bad at all. Kind of elegant, no?
If SQL was designed "by engineers for engineers", you would be using esoteric Git commands just to blow off steam.
This is very important. I personally like old technologies and languages where the designers considered users who had limited technical skills, and most importantly, assumed that those users had no interest or need to improve their technical skills. Removing the assumption that users are willing to increase their technical sophistication forces a designer to think more about what they're designing. Looking at older languages is interesting - for all their warts, they do feel more intentional in their design than modern things that have a clear developer-centric mindset baked in.
It's similar to the discussions around COBOL back in the day. What's interesting is that the primary "end user" language is Excel formulas. Who would have thought a declaratory relational language with matrices would "win" that battle.
Any arguments that "users will write their own" languages are basically flawed. Users want results, if there's no alternative, they'll do it themselves, in the simplest, but probably most inefficient way possible.
I share the author's point of view, which led me to start a new relational programming language that compiles to SQL. It's a way to build on existing databases, like postgres or mysql, with all of their advantages, but improve on many of SQL's limitations.
I was sketching a syntax for python to work with pandas, and avoid extra annoying code, and surprisingly some of the elements and approaches I see in your code are similar. My syntax for subsetting was
I noticed in working with pandas you often need to do lookups into other dataframes. It's partially solved by assignment operator if left field equals right index, or .map method, the same way.
But often you need a lookup with merging by an arbitrary column, then grouping and aggregation by the left table items. This is partially doable without special functions. But this can be a killer feature if one makes this for spatial joins.
Very often you need to do the following:
gdf1 = geodataframe of points
gdf2 = geodataframe of points
need to make gdf1.geometry.buffer(500 m) and sjoin it with gdf2.geometry, then lookup gdf2 fields and bring them to gdf1, and keep original gdf1.geometry (points). This operation takes a dozen of lines and leaves lots of variable garbage if not put into a separate function.
But IMO it could be condensed to something more natively supported.
What I really wanted was something compatible with SQL, but that augmented it (and compiled to plain sql). Like typescript is for javascript, but instead of adding a type system, adding constructs for better composability.
Looks interesting. I've been thinking about trying this myself and one of my goals has been to create a language that's easily introspectible. I think it's much more important for a query language as opposed to an application language, since you'll want to see what code in the former does without running it for integration into application code.
My approach has been to design a very simple (in the lisp sense) syntax, kind of the opposite to SQL where everything is hard-coded into the parser. I've adopted a "pipeline programming"-like approach, where the operations are just (special) functions, which also helps with extensibility. Have you thought about this? From a cursory look, it seems Preql does rely on keywords. Admittedly fewer than SQL, but it also doesn't cover all of its features.
I don't have a lot of experience writing Lisp-y code, so perhaps I'm speaking from ignorance, but I think there is a reason that syntax never gained huge traction. Imho a syntax that's concise and expressive is important for effective coding. Having operators for the most common operations is just a small complication that yields a big reward.
Having said that, the amount of keywords and operators that you see in Preql right now, isn't likely to grow by much. I have the basic and common operators covered, and for the most part, the rest can be done with regular functions.
I agree about introspection, which is why in Preql you can ask for the type of any value, list the columns of a table, get the resulting SQL of an expression as a string, and so on. And certainly more can and should be done towards that.
This is a great article and you can tell the author has deep experience with SQL from the way they speak and the other projects they're involved in.
I think many of the comments here are missing the point by saying "Oh you can get around that issue in that example snippet by doing X Y Z". Sure there are workaround for everything if you know the One Weird Trick with these 10 gotchas that I won't tell you about... but that just makes the authors point.
We can do better. We deserve better.
What could things look like if you could radically alter the SQL language, replace it altogether, or even move layers from databases or applications into each other?
Who knows if it will be better or worse, but I'd like to find out.
>> The core message [...] is that there is potentially a huge amount of value to be unlocked by replacing SQL
To me, a lot of people defends SQL saying that "perfect is the enemy of good" and that SQL simply works. Not the favourite of anyone, but everyone kinda accepts it.
And yeah, it's true. People use SQL because it's good enough, and trying to reinvent the wheel would take more work (individually speaking) than just dealing with SQL as it is right now. For large organizations where the effort could be justified, all your engineers already know SQL anyway, so it's not so great either.
But for something so relevant as relational databases, perfect is not the enemy of good. We do deserve better. We generally agree that SQL has many pitfalls, it's not great for any kind of user (for non-technical users, a visual programming language would work well here, more like what Airtable does, closing the bridge between spreadsheet and hardcore database, and for technical users, it does feel unwieldy and quirky). We should be more open to at least consider critiques and proposals for better. We might find out that people, from time to time, are making some good points.
I agree with Jamie. I think SQL is irritatingly non-composable, many operations require gymnastics to express, and I'd like a more expressive and regular language to write queries in.
I also suspect that such a language would be harder to optimize, and might be more practical to implement if it was less declarative and closer to the plan.
I also think that as you scale up, offloading computation to the database is a false economy. It's closer to the data, but the database itself is a bottleneck. As you scale, you want to limit the complexity of your queries. But this is actually an argument for things like Materialize, i.e. making copies of the data elsewhere that bake in some of the computation ahead of time.
One approach to radically simplify operations with data is to use mathematical functions (in addition to mathematical sets) which is implemented in Prosto data processing toolkit [0] and (new) Column-SQL [1].
[0] https://github.com/asavinov/prosto Prosto is a data processing toolkit - an alternative to map-reduce and join-groupby
[1] https://prosto.readthedocs.io/en/latest/text/column-sql.html Column-SQL (work in progress)
On paper, a more expressive language that spits out SQL queries sounds great, but I've never seen a single one not become a pain in the ass to use.
https://logica.dev/
Speaking of which, what has happened to hadopp and HDFS? Used to be flavor of the month about 10 years ago, but now I hardly ever hear people talk about it.
Trying to change to the Dvorak layout taught me a lot about enacting change on such a grand scale.
After a lot of hassle switching machines and OSes, typing on other user‘s computers, them typing on mine and general headaches among internationalization and inaccessible key combinations, I switched back to QWERTY in the end.
As said in other threads, there is no shortage of attempts to replace SQL. A lot of them are pretty good. But having learned SQL the hard way, I feel zero urge to learn another language from scratch right now.
It‘s why nearly all big databases eventually switched to it despite the FORTRAN vibe and its general ugliness.
Anyway, probably time to SET TRANSACTION LEVEL READ UNCOMMITTED and call it a day ^^
For example, in iOS development (and, more in general, on Apple platforms), there has been a huge shift from Objective-C to Swift.
The same arguments should apply there. Swift is much better, but Objective-C got the work done, and many codebases were written in it, especially at Apple. And yet, the whole community switched pretty quickly.
One could argue that Swift was easier to pick up for newcomers. While that's true, I would then expect the argument to apply also to SQL alternatives.
So, what is the difference here?
On the other hand, the fact that many new time series databases (and other engines, including Materialize which the author of the article worked on) wanted sql (or fake sql-like languages when they couldn’t manage sql) suggests that not being sql can be a big hindrance.
Off-topic, but changing layouts is easy. Also, using other people's keyboards is not the best for hygiene even before covid.
Deleted Comment
Stop fighting SQL so much, and just focus on bringing a better solution. If potential users see it has significant benefits they'll start using it.
Many of these blog articles seem to be written by devs that have only really experienced the user-facing application side of things, and really don’t realise the sheer number of financial, analytic & business systems, that keep the world spinning round, that are all happily buzzing away using SQL.
Moaning about JSON manipulation in SQL is madness. A client should use appropriate joins to get the data from the database & then do the transformation. Storing JSON when you actually need to perform operations on said data makes 0 sense. It’s a huge overhead, an utter nightmare to index etc.
I think that's why NoSQL is where the gains are being made. You can't magically improve things with a query language without putting the effort into performance, but you can get huge gains by changing your assumptions about how data is structured.
There were multiple optimizing compilers which can do a lot of these asks, allowing composable queries that return nested types, but produce SQL queries. I think the pathfinder compiler had the most real-world use, it was meant to efficiently mix SQL and xquery to query SQL+XML documents stored in postgres. It had a c# linq frontend, but also fairly hefty compile times.
But in my world, SQL is much more of a Human Data Interface than Application Programming Interface. SQL is used for both ad-hoc and automated data querying and transformation. It is something manufacturing intelligence and supply chain experts and researchers and others learn to empower themselves to get to the data - and yes, you won't see any CEO running reports in SQL themselves, but it is not for programmers only.
Those people would not benefit from syntax with more brackets instead of words, easier way to define and call functions, or the ability to install libraries - in fact I think it would make it harder and less accessible for them.
The OP is right that for machine-to-machine interface, all the syntax baroqueness is not worth it. And of course, having more portability would be great.
But while machine-to-machine interfaces written by skilled developers who knows exactly what are they doing might actually be the most common use of SQL (since half the internet runs on Wordpress and MySQL and most smartphone apps might have SQLite in them), it is not where the majority of complexity or time is spent - that one is with the people working with data.
In the event that nothing exists that is right for the job: don't worry about replacing SQL, worry about creating/using a matched DSL/API/(whatever is appropriate) and data store for the particular problem you're trying to solve that SQL isn't a good fit for. If you create something that solves its particular problem even half as well as SQL did for relational databases you'll probably have a winner in that space that people with the same problems would be happier using.
Scalable databases are just so difficult that we’re still driving a ‘64 IMPALA
Most of this opinion comes from “SQL” being vendor-specific. Is JSON vendor-specific? Is anything else, that we actually use by choice?
Mad at you too, Graph DBs, for sending us on another snipe hunt by adding vendor-imposed innovations, because it makes the enterprise marginally profitable. It’s how the world works, I’m still not happy about it.
(Disclaimer: I possibly just traveled back to 2002 and said this on slashdot)
Two things that come to mind are Markdown with all its flavors and Regex with multiple engines.
edit: and to a lesser extent maybe C/C++ compilers and JS engines.
edit2: also JVM, Python and Ruby runtimes
But both edits describe technologies with an official spec and slightly different implementations. Markdown/Regex are more comparable to SQL because they have vendor-specific syntax.
What type of database engine is used to execute that query is independent of the language. It could be highly scalable or it could be focussed on single user, single process.
What is the problem that needs solving?
Totally! SQL is like shell, c and js that somehow generate this "never ever try to improve them for real, lets stay suffering for all eternity!"
And the AMOUNT of energy in mask them (transpilers, linters, "good practiques", tricks, etc) is a monumental waste of effort that cost so much than actually go and fix the core issues.
And in this case, SQL is the most trivial to fix. It could get simplified and streamlined far easier that is to replace C, for example.
And a lot of interactions to the DBs are using bridges already. Create a "good sql" is similar to create WASM and then for legacy reasons support in EOL fashion all the old sql and all the new go against the new way.
But this mean at least 2 or 3 major database vendors go for it (I dream: sqlite & postgresql).
P.D: I'm exploring building a relational language, so already have a simplified dialect for it: https://tablam.org/tutorial
In this you might be able to guess my objection to many of OPs points. Because the key thing we look for when we decide to implement tech isn’t how good it is, but how easy it is to work with and more importantly, how easy it is to find other people who can pick up when someone’s career take them forward in life.
SQL is the one technology that has been stable enough over the years that no matter who we hire, we can always count on them being capable of working with our Databases. Yes, some of the more expansive functions, especially some of the recursive ones, take a little time to get accustomed to and there will always be some back-and-forth preferences in regards to stores procedures, but on a whole, it’s the one technology that never causes us any pain and never requires retraining.
I’m sure it’s not efficient, and OP is completely correct about the parts about the untapped potential, and I fully support people developing rational alternatives to SQL, I’m simply pointing out why SQL has been the power horse it has since, well basically, since we moved beyond punch cards and why it’ll likely still be so in 50 years.
Because at the end of the day, technology doesn’t really matter to the business. Cost vs benefit does, and as long as it’s easier and therefor cheaper to have people work with SQL (in the big picture), it’ll be what businesses work with.
The only reason you don’t see something similar in declarative programming (or even functional) is really beyond me but it probably has to do with how things evolved and how much easier it is to change that part of the tech stack compared to your databases. If we get a new hire who can do magic with JavaScript/Python/whatever we may consider allowing him/her to do that. We don’t want to, again because the big picture is easier to manage with fewer technologies, but it’s much easier to justify the cost-benefit of that compared to someone wanting to use a different database than the one in our managed cluster. Already you need more staff than that person to maintain it, as developers don’t run database server operations, aren’t part of the contingency staff and so on.
Like I said, I fully encourage the direction OP wants to go. Enterprise rarely invent things, we’d be happy with ball frames if that’s what made us the most money, so we need people like OP to invent things for us to adopt.
I also happen to really dislike how the author hasn't capitalized the syntax like SELECT FROM WHERE or CREATE TABLE which, to me, poorly affects the legibility and therefore makes me less interested in reading the argument overall.
Unfortunately, building a new database is a huge project and there appears to be no party currently willing to sponsor it.
I would also guess that we could have a better SQL but I do not think it could and should look anything like a general purpose programming language because otherwise you might get in the way of efficient query execution. Maybe some kind of declarative data flow description with sufficient facilities to name and reuse bits and pieces.
And maybe you actually want two languages which SQL with its procedural parts already kind of has. One limited but well optimizable language to actually access the data, one more general language to perform additional processing without having to leave the server. Maybe the real problem of SQL lies mostly in its procedural part and how it interfaces and interacts with the query part.
Probably in exactly the opposite way: the limitations of SQL put a lot of work on the back of the query optimiser without allowing for said optimiser to easily reason about the queries, or for the writer to easily feed the optimiser (short of dedicated extensions e.g. Oracle's optimizer hints).
Or so I would think, I've never heard of SQL being praised for its optimisability.
An example here is how, sure, in theory, JITs can outpace AOT compilation because they have all the information the AOT compiler has plus runtime insights. But the ability to truly do that always seems to be a decade of compiler development away, with many giving up on the idea entirely.
It's also important to consider what we're comparing SQL's optimizability against. If it's against typical NoSQL databases, most of which seem to favour a lower-level query specification, I can defend SQL's optimizability to the end - with SQL databases having the freedom to dynamically switch algorithms to adapt to the actual contents of the database at the time of the query. Something which, ironically, a stale optimizer hint (i.e. written with the size & shape of the database as it was last year in mind) can actually get in the way of. Not that I'm saying that SQL planners never produce stupid query plans of course.
Maybe? But I kind of feel that the query optimizer often gets in the way of efficient query execution.
There’s so much stuff that’s semantically the same, but if you shuffle some strings around, the query is suddenly 10 times faster.
There are quite a lot of pretty basic things many programmers would want to do that just are way more stupid than they should be. One that irks me is things like "return me the top 100 items", "return me/join against the most event for item X in a history log", etc) that end up requiring way more shit than they should just because there's no standards-compliant way to say "select first X rows from ... where ... order by ..." or "join first row where .. order by ...". In the case of top 100 you just wrap it in another query, but for more analytical stuff like "join the most recent X for this item" you have to use window functions and the syntax fucking sucks.
Since you mention optimization, perhaps it would help to allow the abstraction to be peeled away and write clauses that express imperative commands. Like being able to write that what you want is an "index join table X on index Y", etc. That's sorta what hints do, but roll them into the language in a portable way. It could also allow the query planner to look at it and tell you "that's just not possible and you need to check your keys/indexes/queries/etc", rather than having to run it and guess what it's trying to do.
Because I kinda feel that's a lot of the tension for SQL queries (beyond the QOL stuff that everyone knows is stupid). It's a guessing game. You're trying to write queries that will be planned well, but the query planner is not a simple thing that you can easily predict. The closest analogy is probably something like OpenGL/DirectX where you're just guessing at what behaviors might trigger some given heuristic in the driver and there are a lot of unseen weights and flags. There are "equivalent" SQL expressions that are much heavier than another seemingly similar one (like "select from ... join ... where exists (select ...)" vs "select where x in (select ...)". There are operations that are mysteriously expensive because you're missing an index or the query isn't getting planned the way you expect.
The suggestion of a "procedural" expression is, I think, also probably correct for some situations. PL/SQL functions are extremely useful, just obnoxious and in some cases arcane to write (as someone who never had formal education in DBA). It would be even nicer if you could have an embedded Python-style thing that let you have "python" syntax with iteration and list comprehensions and shit, that represent DB queries using cursors and shit, and perhaps defer execution until some final "execute" command while transforming it all into a SQL query. Like C#'s LINQ but for Python, but instead of buffering streams of objects transform it into a database query. Transform operations/etc on fields into SQL statements that work on columns, etc.
Or java if you will. Imagine a Java Streams API that compiles down to planner ops. I know Hibernate Criteria and JPA exists, but skip that API and express it as streams: iterations and transforms and so on, and map that onto the DB. Being able to build subqueries, then attach them to larger queries, etc. That way they execute in bytecode rather than pl/java.
Problem with query optimization is that it needs to be done at runtime, you can't optimize it in some procedure language easily. The optimal way to retrieve data depends on the number of records in your tables, the where clauses you use, the actual values you are filtering by, the records already in cache, etc.
99% of all programmers would not be able to program better performing queries when doing this with a procedural language or streams expressions or it would take them way too long.
Uses python expressions and generates SQL.
Also does static typing, so you can run a type checker on the code
When you are sitting in a properly normalized database, it is a lot easier to write joins and views such that you can compose higher order queries on top.
If you are doing any sort of self-joins or other recursive/case madness, the SQL itself is typically not the problem. Whoever sat down with the business experts on day 1 in that conference room probably got the relational model wrong and they are ultimately to blame for your suffering.
If you have an opportunity to start over on a schema, don't try to do it in the database the first few times. Build it in excel and kick it around with the stakeholders for a few weeks. Once 100% of the participants are comfortable and understand why things are structured (related) the way they are, you can then proceed with the prototype implementation.
Achieving 3NF or better is usually a fundamental requirement for ensuring any meaningfully-complex schema doesn't go off the rails over time.
Only after you get it correct (facts/types/relations) should you even think about what performance issues might arise from what you just modeled. Premature optimization is how you end up screwing yourself really badly 99% of the time. Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching.
A not-shitty schema is always to be preferred over a shitty one, to be sure. But for any data schema, there will be queries which cut against unexpected joints (joins?).
And SQL is bad for this. The entire Fine Article is a detailed exploration of how it's bad for this. A decent query language would keep the easy things easy (SQL is ok at this) and wouldn't make the hard things pointlessly difficult or impossible-except-at-the-application-layer.
Sometimes it's just a weird or one-off query, and sometimes there are so many of them that migrating to a better schema is indicated. Since it's difficult to do the dozen or so things the article sketches out, it's also difficult to iterate your way to a better schema arrangement.
The sort of waterfall database design you advocate (which, yes, think long and hard about your data!) always breaks down, because building business logic is a process of discovery. There are always missing pieces.
Which one of the frustrations from the article boils down to fighting against shitty schema?
It is really easy to take a negative stance on a technology and poke holes in it without having some grounding in reality. If you are dealing with a problem domain that has hundreds of types/facts/relations, and the need to author logic spanning all of these things all at once, you are going to have a hell of a time managing this with any other technology. Looking at SQL through the lens of toy problems is a massive mistake. You have to look at it through the lens of the worst problems imaginable to begin seeing the light.
I used to think SQL was some ancient trash until I saw the glory of a 40 table join in a factory automation setting. If you need dynamic, hard answers right now, SQL is the only option that makes sense at this kind of scale.
I felt it would contribute more meaningfully to the overall discussion on HN if I were to sidestep a direct critique and present a view of the problem from a different perspective.
Personal experience incoming: At startups, it's usually a mess because hiring someone who knows databases seems to always come so late in the game. At bigger corporations, well, hopefully the developers and database people get along and talk - otherwise, one of those teams is going to be a bottleneck.
> Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching
So true. This also extends into general purpose languages, everything is so much easier when you take the time to model things correctly.
Schema design always results in endless "are we gonna need this?" questions, and usually everything is postponed down the road because it adds needless upfront complexity.
All data are naturally triplets, and with the relational model we are forced to group the triplet attributes into tables. The difficulty arises because these groupings are often incorrect, and then difficult to change. There is always premature optimization, because the SQL query statements become insane with too much normalization.
A bigger problem is how sticky schemas are due to how difficult they are to change, and how difficult it is to update your code to support a schema modification.
I think these two problems need more attention.
We've gone too far down the code-first / code-generation ORM approach which makes tinkering difficult.
I think all database design should happen visually. MS Access style. You should design your schema alongside your data (as you suggested by using spreadsheets). But instead of then transferring that to code, it should be a living schema that connects to your IDE code refactoring. The more declarative queries are and the closer they are to what needs to be finally rendered the more chance of reliable refactoring. The more imperative code modifying the queried data, the more complex and difficult to refactor things become. A lot of the cruft in a codebase is when schema changes are additive because people are too lazy or too risk averse to refactor.
E.g. Think about a User having one address, and then needing to support multiple addresses. There's quick and dirty way by adding more columns or adding to a JSON column, or we add a new Address entity and keep the existing User address columns, or we delete those columns and migrate data to the new Address table - which is the cleanest way to represent this schema change. I think there are few who would do the last option though, and this is the problem. Hiding this behind an API also causes more problems down the line because our frontend data model starts to separate from our actual DB data model.
We need better tools, and less code.
>create table json_value(id integer);
>create table json_bool(id integer, value bool)
>create table json_number(id integer, value double);
No, the usual response is "Don't do that!"
99% of the time you either know the data types (so each JSON object becomes a row in a table where the column names are keys) or you don't know the data types and store the whole object as a BLOB
I'd be on board with adding unions to SQL, but I doubt I'd use that feature very often.
That said, I'm not averse to the idea if someone can provide a realistic use case. The JSON example in the article is misguided though - you should not save the structure of the syntax of a serialization format, you should save the data model which is represented by the syntax.
You're right, unions are everywhere. Right now a human has to think about each union and how to represent it in a database. It would be really cool if I could store capnp objects like the one below and still get optimal query performance and aesthetics without thinking about it:
struct Shape {
But I also feel that maybe they are asking a bit much from SQL. The complaint that complex subqueries are complex... Well then don't use them? I would use WITH constructs in that situation because I find them easier to read but that's beside the point. I think its perfectly fine to pull out multiple result sets from simple queries and then do the complex stuff in your host language.
But this article is thought provoking to say the least. It follows the courtroom logic of holding the defendant SQL on trial for as much as possible. And SQL is guilty of a lot of crimes.
I do hope GraphQL and similar query languages become more prevalent and standardized, as it seems SQL could really use some stiffer competition.
Going to graph databases is certainly throwing the baby out with the bathwater. The relational model was invented to address shortcomings in the hierarchical and graph database models.
In GraphQL, you can select fields of JSON objects, optionally a field may have parameters which affect the values it returns in unspecified ways. That's it. Because of this design, unlike in SQL where you are concerned with modeling the structure of your data, GraphQL also requires you to think about the API all users will use to access the data. In SQL, I can write a table defining the structure of my dataset, and then users of that table can perform arbitrary queries when they know what data they need (aggregate with a month average grouped by account ID, filter to only rows where x = 7, JOIN this to grab some data from some other table etc.).
GraphQL has no aggregates (sum, average...), no grouping, no joins, no sorting, no filtering, other than what you manually design each using parameters at schema design time. Good luck anticipating the use cases of every future consumer of your data. Miss one? Better dig back into your implementation code & implement that use case each & every time a new one comes up.
The only part of GraphQL that is standardized is the query syntax. In SQL, the actual underlying relational data model exists and the syntax of queries exists within that context, not so in GraphQL land. In SQL, I define my data structures, and users can write queries and access the data. But GraphQL throws up it's hands and says "not my problem, try one of these libraries that ask you to implement your own custom data access functionality for all your data types".
OK, so it's a rubbish query language, but even the graph part of the name is misleading. Assuming that you even have a domain that it makes sense to model with a graph of types, GraphQL provides you no tools for dealing with the backend complexity of such a design. Because the syntax is so simplified, there is no mechanism within the syntax to define rules about relationships between types. For example, imagine a simple parent/child relationship. There is no mechanism within the syntax to tell GraphQL that for parent X, parent.child = parent.child.parent . So you can't even think about writing a GraphQL query optimizer, because there isn't enough information about the structure of the data encoded into the schema or query to do so.
So in practice no GraphQL implementations that I know of have anything resembling a query optimizer - someone asks for cart.item, and then item.cart for a cart with 1000 items? Have fun re-requesting the cart from your DB 1000 times (yes you can cache the item by ID to save a DB lookup, but we shouldn't even need to hit cache here! Every programmer involved knows the data is the same, it's just dumb GraphQL has no clue about the details of the relationship).
It seems like having multiple vendors is only valuable if their products are to some degree differentiated, no?
As to SQL. Its a weird feeling to read all that; I've spent 15 years working with very large relational databases -- and about 5 years ago I ditched it all in favor of using key-value object stores, wasting some storage but saving a metric ton of manhours of development work. Not looking back.
Mostly the noSQL pattern is a mistake because you almost certainly have columns that recur reasonably frequently, but if you do have high-dimensional data or want to store processed documents/etc, you can represent them as a JSON/JSONB in postgres/etc, and even run indexes or queries on them.
> The original idea of relational databases was that they would be queried directly from the client. With the rise of the web this idea died - SQL is too complex to be easily secured against adversarial input, cache invalidation for SQL queries is too hard, and there is no way to easily spawn background tasks (eg resizing images) or to communicate with the rest of the world (eg sending email). And the SQL language itself was not an appealing target for adding these capabilities.
> So instead we added the 'application layer' - a process written in a reasonable programming language that would live between the database and the client and manage their communication. And we invented ORM to patch over the weaknesses of SQL, especially the lack of compressibility.
>When Ray and I were designing Sequel in 1974, we thought that the predominant use of the language would be for ad-hoc queries by planners and other professionals whose domain of expertise was not primarily data- base management. We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.
https://ieeexplore.ieee.org/document/6359709
In that case it has been an abject failure. I have been using SQL since the mid 1980s (so pretty much since the start of its widespread adoption) and I have never met "ordinary people" (by which I assume intelligent business-oriented professionals) who could (or wanted to) cope with it.
I like it, but the idea of sets does not come naturally to most people (me included). But I once worked with a programmer who had never been exposed to SQL - I leant him an introductory book on it and he came in the next day and said "Oh, of course, it's all sets, isn't it?" and then went on to write some of the most fiendish queries I have ever seen.
If SQL was designed "by engineers for engineers", you would be using esoteric Git commands just to blow off steam.
Any arguments that "users will write their own" languages are basically flawed. Users want results, if there's no alternative, they'll do it themselves, in the simplest, but probably most inefficient way possible.
If that sounds interesting, you can find it here: https://github.com/erezsh/Preql
But often you need a lookup with merging by an arbitrary column, then grouping and aggregation by the left table items. This is partially doable without special functions. But this can be a killer feature if one makes this for spatial joins.
Very often you need to do the following:
need to make gdf1.geometry.buffer(500 m) and sjoin it with gdf2.geometry, then lookup gdf2 fields and bring them to gdf1, and keep original gdf1.geometry (points). This operation takes a dozen of lines and leaves lots of variable garbage if not put into a separate function.But IMO it could be condensed to something more natively supported.
Something like.. BQL http://intelligiblebabble.com/a-better-query-language-bql-la... which went nowhere https://github.com/lelandrichardson/BQL
My approach has been to design a very simple (in the lisp sense) syntax, kind of the opposite to SQL where everything is hard-coded into the parser. I've adopted a "pipeline programming"-like approach, where the operations are just (special) functions, which also helps with extensibility. Have you thought about this? From a cursory look, it seems Preql does rely on keywords. Admittedly fewer than SQL, but it also doesn't cover all of its features.
Having said that, the amount of keywords and operators that you see in Preql right now, isn't likely to grow by much. I have the basic and common operators covered, and for the most part, the rest can be done with regular functions.
I agree about introspection, which is why in Preql you can ask for the type of any value, list the columns of a table, get the resulting SQL of an expression as a string, and so on. And certainly more can and should be done towards that.
I think many of the comments here are missing the point by saying "Oh you can get around that issue in that example snippet by doing X Y Z". Sure there are workaround for everything if you know the One Weird Trick with these 10 gotchas that I won't tell you about... but that just makes the authors point.
We can do better. We deserve better.
What could things look like if you could radically alter the SQL language, replace it altogether, or even move layers from databases or applications into each other?
Who knows if it will be better or worse, but I'd like to find out.