SQL language proposal: JOIN FOREIGN

This is not actually compatible with SQL semantics.

An important constraint on SQL is that a query must run, and produce correct results, relying only on the structure and content of tables. Indexes (can) make queries faster but must not inhibit, or be required for, correctness. The same is true of primary key constraints, foreign key constraints, check constraints, defaults, triggers, partitioning, whether a table is heap/clustered, and literally every other implementation detail of the RDBMS.

This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.

So no, let's not do this.

JoelJacobson · 4 years ago

> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

Actually, constraint names already do appear in some DQL statements, such as the quite recently added INSERT INTO ... ON CONFLICT in PostgreSQL [1]

    INSERT INTO ... ON CONFLICT [ conflict_target ] conflict_action

where conflict_target can be one of:

    ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
    ON CONSTRAINT constraint_name

[1] https://www.postgresql.org/docs/current/sql-insert.html

bradleyjg · 4 years ago

PostgresSQL isn’t SQL though. Vendor extensions have done all kinds of inelegant or even outright ill advised things from the beginning.

cryptonector · 4 years ago

That's DML, not DQL. Still, I disagree with GP's arguments (see separate comment).

Deleted Comment

cryptonector · 4 years ago

> and produce correct results

This doesn't interfere with that.

> relying only on the structure and content of tables

Constraints are part of the table schema, not index schema.

> his proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement (as distinct from DDL statements where such names can already appear). That's not allowed.

Not allowed says who? The standard? I doubt it, and anyways, it can be changed.

"That's not allowed" is not a good argument. A better argument is that this is the first time a constraint can change the meaning of a query -- that is a good argument, but it would be better if the constraint could change existing queries, which it does not do. Because this would only affect queries that refer to the constraint, this seems quite allowable to me.

> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, ...

This is not about optimization, ergo this argument is out.

I'm not sure I want this particular extension, but I don't buy your arguments against it.

goto11 · 4 years ago

> More importantly, it's _not a good idea_, because that semantic separation between data on one side, and the machinery of acceleration and validation on the other side, is critical to the value prop of the relational model and a big reason why it's been so hyper-successful.

Constraints are part of the logical model just like the types of columns are. Indexes on the other hand is part of the physical model and should be transparent to the logical model. Database engines tend to couple foreign-key constraints with indexes, since you usually want an index on a foreign-key. But in principle they are separate.

So I don't see any violation of the relational model in this proposal. I do like the general idea of extending SQL with metadata-aware abstractions, although I'm not a fan of this particular syntax.

dragonwriter · 4 years ago

> This proposal would break that constraint, by elevating the _name_ of a foreign key constraint to an identifier usable in a DQL statement

This doesn't break that constraint in an SQL RDBMS that also implements the relational model, since it is a fundamental element of the relational model that schema metadata is stored as data, and therefore constraint specifications, including names, are included within “content of tables”.

lolive · 4 years ago

To me, making the data relationships implicit (and making them explicit again in the DAO layer of your app) is a nightmare from a data design point of view. In real life, a DB is often tied to the app that was built on top of it. Making data reuse, data discovery and data understanding VERY hard.

Deleted Comment

jpz · 4 years ago

I agree with all the points you make here, we’ll said.

What I really need is something like:

SELECT -col1, -col14 FROM table LIMIT 50;

Where the minus sign means I don't want these two columns. I still don't see a way to do it easily (for Vertica and in Datagrip).

nvartolomei · 4 years ago

Some have support for this.

- BQ https://cloud.google.com/bigquery/docs/reference/standard-sq...

- CH https://clickhouse.com/docs/en/sql-reference/statements/sele...

hnthrowaway0315 · 4 years ago

Thanks, these really look neat. Unfortunately we only use Vertica :/

piaste · 4 years ago

Similarly, I'd love some form of

    GROUP BY every column except for <these>

It feels silly when you are SELECTing a ton of columns, then you add a JOIN to a many-to-one relationship which you want to aggregate. Now you need to either make it a subquery (and hope the optimizer doesn't screw up) or duplicate all your SELECT expression (not even the identifiers) into the GROUP BY.

baskethead · 4 years ago

Queries like this won’t break but will silently fail upstream by missing data if column names change or are deleted. The explicit nature of SQL ensures that it will break positively which is a better type of failure in my opinion.

icambron · 4 years ago

I wish it just had a "group by all the stuff I selected without aggregation" shortcut

Svip · 4 years ago

You know, you can just write GROUP BY x, y, ...

    SELECT t.i+1, count(*) FROM table t GROUP BY 1

1 in this context means the first selected item (i.e. t.i+1). I know this works in PostgreSQL.

dragonwriter · 4 years ago

What I really want is a special phrase MINIMAL GROUPING such that GROUP BY MINIMAL GROUPING includes exactly the items that it would be an error not to include in the GROUP BY.

oweiler · 4 years ago

Can't you use an alias in a group by?

Deleted Comment

blondin · 4 years ago

i would rather it written as SELECT * BUT or SELECT * EXCEPT or even SELECT ALL BUT.

SQL has always been that language that is easy to read. even when you don't understand what the queries are doing. adding a cryptic syntax like "-column" would make it less readable.

polygotdomain · 4 years ago

What about SELECT * WITHOUT {columns_you_do_not_want} FROM...

EXCEPT is already a keyword and has is used for set-based operations, so I don't think it's good to overload it.

recursive · 4 years ago

The given syntax is obviously a no-go. It's already unary minus.

colinmhayes · 4 years ago

EXCEPT is already a sql keyword

bobbyi · 4 years ago

If you have specific columns that you frequently want to ignore (which, I think, is the common case of this), you could define a view that selects all the columns except those and do your queries against that view.

Eremotherium · 4 years ago

Or create a function that returns a view of a table minus named columns in situ if that's really a common case for you. Just as an idea.

dolmen · 4 years ago

> SELECT -col1, -col14 FROM table LIMIT 50;

This is already valid SQL. Example: SELECT -col1, -col4 FROM (SELECT 1 AS col1, 2 AS col4) AS tbl;

Do you think seriously that a new meaning could ever be attached to that syntax?

hnthrowaway0315 · 4 years ago

I just say something similar, doesn't have to be the exact form.

dspillett · 4 years ago

There are often occupancies where I'd like this in ad-hoc queries, but it would be something to avoid and discourage in production code for the same reasons “SELECT *” is discouraged (because it is “SELECT *” with a bit of sugar added)

stingraycharles · 4 years ago

Simple, effective, and really useful in a lot of situations. I like it!

jdunstan · 4 years ago

That's very close to SELECT *, which has it's own dangers. I agree that it would be nice for exploration and testing, but probably should not be used in production.

hnthrowaway0315 · 4 years ago

The problem I found out is that SELECT * LIMIT 10 is guaranteed to be needed so it's actually much more important than whatever is in production (you have to explore a lot before writing the production scripts). My idea is that exploration should be as easy as possible.

Things such as SELECT * EXCEPT col1, col2 are really a PIA to write and can build up frustration level really quickly. Certain IDEs such as Datagrip ease the process by providing "macros" but they are not enough.

Another thing is to generate useful boilerplates such as SELECT col1 FROM table GROUP BY col1 ORDER BY col1 to explore all unique values of col1.

ipaddr · 4 years ago

What dangers lay in select *...? Too much data?

c06n · 4 years ago

R's data.table has that.

dt[1:50, -c('col1', 'col14')]

slotrans · 4 years ago

Smotko · 4 years ago

It might be just me, but I feel like remembering the foreign key name is more difficult than remembering the columns that you need in the ON clause. Especially since you can usually find the column names by just seeing the data in the table (select * from x) wheres seeing the foreign key names is much harder (show create table x?).

Also, if you use an ORM it will usually generate foreign key names that are almost impossible to remember.

rtpg · 4 years ago

In a universe where foreign key index names are important we would specify better names.

I think stuff like “documents_by_user” as foreign key names and explicit index usage would improve peoples awareness of how indices get used and would generally be a positive

exabrial · 4 years ago

I think this is an operator problem. You're using the wrong tool for the job.

TablePlus, SequelAce, the official MySQL client all support cntrl-space autocompletion. I wish we used Postgres, but I imagine the landscape is the same. The big box databases like Oracle, DB2 undoubtedly having this tooling as well.

That being said, here is our fk naming convention: `fk-asset_types->asset_categories` which pretty states what's going on and is easy to remember.

SQL is not only written in an SQL client. SQL is also written (and read from) embedded/mixed in an other programming language were tooling is not always available.

Having to know the names of foreign keys (in addition to the column names of the 2 tables) is adding more cognitive load. I don't think that is an improvement.

It would indeed be difficult to remember, but the proposal also suggest changing the default naming convention for foreign keys, to give them the same name as the referenced table.

If using an ORM, I would guess this proposal isn't useful, since then you wouldn't hand-write queries anyway, right? Except when you want to override the queries generated by the ORM? (I'm not an ORM user myself.)

to11mtm · 4 years ago

Speaking as someone who has used ORMs in the past and contributes to a LINQ Micro ORM...

It might make tooling 'easier', but since backwards compatibility has to be considered the actual value add is questionable IMO.

Most ORMs/MicroORMs will have tooling that sniffs out the DB Schema including foreign keys, and if you are using those bits (i.e. 'not hand written') most will do the right thing today. I suppose you could include some extra syntax for whatever DSL you're providing users....

IDK. Speaking as someone who is very comfortable in SQL, This feels more like syntactic sugar than anything else.

This is the same idea as already posted in this thread https://news.ycombinator.com/item?id=29687134, but with a more in-depth explanation that couldn't fit in the comment field, and with some syntax improvements, such as using "FOREIGN", which is a reserved keyword, instead of the previously suggested "->" notation.

Thanks for all the valuable comments on last proposal. Excited to hear what you think about this update.

thom · 4 years ago

Of all the myriad indignities of SQL, this isn't near the top of my list. I also don't like making the names of objects like foreign keys and indexes first class concerns in your queries, that's a whole new layer of cognitive overhead.

degenerate · 4 years ago

Agree. I'd much prefer they worked on the way MySQL determines what indexes to use on a multi-table join, so there could be less emphasis on the need to ANALYZE TABLE and FORCE INDEX when the DB is seemingly being dumb, but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up. Troubleshooting full table scans on large sets of data is a nightmare.

> but there's not enough information in EXPLAIN to tell you where it's actually getting tripped up

Unfortunately I don't think providing that information is generally possible. In some specific cases there are useful details it could provide, but there would usually be a myriad of other similar details that are irrelevant and if it included all those you'd not see the wood for the trees.

EXPLAIN and query plan outputs in other DBs are “this is what I did” not really “why I did what I did”. To make the query planner bright enough to know what details would be useful to you, would probably pretty much require making it bright enough to do the optimisation job without you¹.

[1] picking better indexes without hints, even creating those that are often needed, etc.

MS's SQL Server tries to do this a bit with index suggestions. These are sometimes handy, but often at best for guidance². I've seen people blindly follow these suggestions to get a %-or-few gain from a small set of queries that could see orders of magnitude improvement with just a little tweaking elsewhere³, slowly amassing collections of indexes for very specific cases, sometimes multiple on the same key columns but each INCLUDEing a different mix of other data, that balloon their storage requirements⁴.

[2] a nudge in the direction of “Mr Dev/DBA, you might want to think about how I'd avoid scanning this large object or performing many many thousands of seeks on this other one”

[3] refactoring non-sargable predicates, index changes on other tables being referred to, getting rid of “SELECT *” particularly when referring to hideous views, ...

[4] and having the knock-on effects of slowing insert/update activity & important admin functions (particularly backups)

Tostino · 4 years ago

Slightly different group of people working on MySQL vs this proposal.

Upitor · 4 years ago

In my opinion this proposal seems only to consider simple cases, but there are many not-so-simple relationsship types:

Consider a ‘sales’ table which includes columns [time] and [sold_by_employee_id], and a periodized ‘employee’ table which includes columns [employee_id], [valid_from] and [valid_to] columns. There is a perfectly valid relationsship between the two tables, but you cant join them using only equal-statements (you need a between-statement as well)

Nice example! The join you describe would remain as a JOIN ON.

This is per design. Quote from the proposal:

"The idea is to improve the SQL language, specifically the join syntax, for the special but common case when joining on foreign key columns." ... "If the common simple joins (when joining on foreign key columns) would be written in a different syntax, the remaining joins would visually stand out and we could focus on making sure we understand them when reading a large SQL query."

So, the special non-equal based join condition you describe, would become more visible, and stand out, allowing readers to pay more attention to it.

The hypothesis is most joins are made on foreign key columns, so if we can improve such cases, a lot can be won.

DevKoala · 4 years ago

But how could you accurately tell if some queries join on the foreign key, but were written by someone without knowledge of the new specification?

I see. I should have read the article more carefully :-)

Also, consider a ‘sales’ table with multiple references to a ‘calendar’ table: [shipped_date], [order_date], [received_date]

Good example too, but this one can with benefit be written using the JOIN FOREIGN syntax, you just need to give the foreign keys suitable names such as e.g. "shipped_date", "order_date", "received_date". Or, to remind you of which is the referenced table, perhaps you want to include it in the names, and the names would be "shipped_date_calendar", "order_date_calendar", "received_date_calendar", but personally I would prefer just "shipped_date" or perhaps even just "shipped".

CRConrad · 4 years ago

Are those timestamps, or actual dates, i.e. meaning a whole day encompassing (usually) 24 hours?

In many (most?) data warehousing projects I've seen, you make a "fake date" integer the primary key column of your Dates dimension. This integer consists of 10000 × YEAR_PART + 100 × MONTH_PART + 1 × DAY_PART of the date in question, so yesterday's New Year's Eve woul get 10000 × 2021 + 100 × 12 + 1 × 31 = 20211231. The date dimension itself has many more columns (often booleans, IS_WEEKEND, IS_HOLIDAY, etc; also the date parts themselves in both numeric and character form (12, 'December'), day of week (5 [or 6, depending on convention], 'Friday'), etc) that are used for BI and reporting.

But since this generated ISO-8601-date-as-integer column is the primary key of the Dates table, it is also the value of the Date foreign key in all tables that reference Dates. That makes it incredibly handy in queries -- both during development and for ad-hoc reports -- of those tables without joining to the Dates dimension at all: grouping, sorting, limiting to a more manageable date interval in the WHERE clause... And it tells the reader exactly what the actual date in question is. (Well, at least readers who are used to ISO-8601-format dates.) As jerryp would have said, recommended.

tester756 · 4 years ago

While we are at it

let's write SQL queries starting from FROM.

`FROM users SELECT *`

It'd allow tooling to provide IntelliSense better.

miohtama · 4 years ago

While this might be a joke, the world has been ripe for a replacement for SQL… for the last 20 years.

hdjrudni · 4 years ago

I started writing a new SQL parser that just transforms the query...this was one of the features. The other being a syntax for group-wise max queries which are just crazy dumb to write efficiently in MySQL.

robocat · 4 years ago

I could see some standardised SQL language extension, similar to how TypeScript extended JavaScript, that has a transformation phase into the native SQL dialect within the database access library, or via some middleware.

vosper · 4 years ago

I agree, though I'll give SQL a pass for this because it's old.

But how the Javascript world ever thought that `import { function } from 'library'` was better than `from 'library' import { function }` I'll never know. Python got this right long before anyone was even thinking about adding imports to JS!

goscript · 4 years ago

I agree but not with current JavaScript modules. i would rather work with this:

    import { functionA } from 'library';
    import { functionB } from '../utils/core/abc';
    import { functionC } from './a';

over:

    from 'library' import { functionA };
    from '../utils/core/abc' import { functionB };
    from './a' import { functionC };

Ya..that one is particularly sad because it was recent. Don't know why they felt the need to botch that.

zoomablemind · 4 years ago

I wonder if primary objective of this proposal is to increase convenience and reduce the amount of typing that users need to do, when using SQL inline?

If indeed that's the goal, then it targets a rather specific subset of users dealing with explorative/ad-hoc analysis on a database. Once such analysis is done, the queries would usually need to be formalized for robustness and to avoid ambiguities.

Obviously, the whole train of queries would derail, should the FK (which is just an index) be dropped for one reason or the other.

The existing JOIN features are explicit at least on the level of specified table structure. I believe, any constraint details in such context will be, well, ...foreign.

Perhaps, a simple solution to verbosity problem may be to use an "intelligent" SQL client, which supports some form of autocomplete and which may as well internally use as many schema/data details as available.

In anycase, thanks for making the proposal. I was not aware of JOIN ... USING syntax. I often wanted some convenient way of specifying homonymous join columns, as some schemes are consistent in such namings. So typing JOIN on col1, col2... would translate into equality joins between the listed tables. However, again, there is ambiguity here...

branko_d · 4 years ago

> FK (which is just an index)

Correction: foreign key is not an index. Many (most?) DBMSes allow a FK to exist without an index.