While postgres is indeed case sensitive usually writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching. It absolutely isn't needed but if I'm debugging a query of yours I will send it through my prettifier so that I can breeze through your definitions without getting hung up on minor weird syntax things.
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
I find all caps identifiers wind up just looking like interchangeable blocks, where lowercase have word shapes. So all caps just slows down my reading.
I feel similarly and I also have friends with Dyslexia with even stronger opinions on it. All caps in addition to being "shouting" to my ancient internet-using brain (and thus rude in most cases), creates big similar rectangular blocks as word shapes and is such a big speed bump to reading speed for everyone (whether or not they notice it). For some of my friends with Dyslexia that have a huge tough time with word shapes at the best of times, all caps can be a hard stop "cannot read" blocker for them. They say it is like trying to read a redacted document where someone just made rectangular black marker cross outs.
Personally, given SQL's intended similarity to English, I find that I like English "sentence case" for it, with the opening keyword starting with a capital letter and nearly every remaining letter lower case (except for Proper Nouns, the truly case-sensitive parts of SQL like table names). Sentence case has been helpful to me in the past in spotting things like missing semicolons in dialects like Postgres' that require them, and/or near keywords like `Merge` that require them or helping to visually make sure the `select` under a `Merge` is intended as a clause rather than starting a new "sentence".
I prefer lower case for my personal legibility reasons and it seems like a prettyfier should be able to adjust to that user’s preference. It’s not a team sport for me so I never had a conflict of styles other than converting public code samples to match my way.
I've always found it funny that SQL was designed the way it is to be as close to natural English as possible, but then they went ahead and made everything all-caps
My understanding is that the caps were syntax highlighting on monochrome screens; no longer needed with colour. Can't provide a reference, it's an old memory.
Most of the SQL I write is within a string of another programming language, so it's essentially monochrome unless there's some really fancy syntax highlighting going on.
I think the "color is all we need" idea makes sense in proportion to how many of our tools actually support colorization.
E.g., the last time I used the psql program, I don't think it had colorization of the SQL, despite running in a color-capable terminal emulator.
It probably doesn't help that terminal colors are a bit of mess. E.g., piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.
Like sigils in that regard. Perl-type sigils are extremely nice... if you're editing in Notepad or some ancient vi without syntax highlighting and the ability to ID references on request. Little point to them, if you've got more-capable tools.
I'm curious about this for DuckDB [1]. In the last couple months or so I've been using DuckDB as a one-step solution to all problems I solve. In fact my development environment rarely requires anything other than Python and DuckDB (and some Rust if native code is necessary). DuckDB is an insanely fast and featureful analytic db. It'd be nice to have a linter, formatter etc specifically for DuckDB.
There is sqlfluff etc but I'm curious what people use.
[1] DuckDB SQL dialect is very close to Postgres, it's compatible in many ways but has some extra QOL features related to analytics, and lacks a few features like `vacuum full`;
Ditto - if I'm throwing out an inspection query just to get a sense of what kind of data is in a column I won't bother with proper readable syntax (i.e. a `select distinct status from widgets`). I only really care about code I'll need to reread.
> writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching
Considering most programming languages do just fine without ALL CAPS KEYWORDS I'd say it's a strange effort. I wish SQL didn't insist on being different this way.
I agree with you on prettification though. As long as the repository chooses a prettifier you can view it your way then commit it their way. So that's my advice: always demand prettification for pull requests.
I don't write or read SQL too often, but I prefer the ALL_CAPS because it usually lets me know if something is part of the SQL syntax itself or a function or whatever, or if it's referencing a table/column/etc.
Obviously not very foolproof, but automated linters/prettifiers like the one in DataGrip do a good job with this for every query I've ever thrown at it.
The all caps syntax also helps queries stand out as distinct from typical source languages. It is often helpful, since SQL tends to end up in all sorts of applications.
I think fighting in PRs over syntax preferences is pretty useless so dev shops should generally have code style guidelines to help keep things consistent. In my company we use all caps casing since we have momentum in that direction but I think that decision can be reasonable in either direction as long as it's consistent - it's like tabs vs. spaces... I've worked in companies with both preferences, I just configure my editor to auto-pretty code coming out and auto-lint code going in and never worry about it.
I think the increased legibility for visual pattern matching also makes SQL easier to read for many of the 350 million color blind people in the world.
Why don't they deprecate some of these features? If they're such easy stumbling blocks, seems like it makes sense to disable things like table inheritance in new schemas, and require some kind of arcane setting to re-enable them.
e.g. the suggested replacement for timestamp is timestamptz, which has its own problems (notably, it eagerly converts to UTC, which means it cannot account for TZ rule changes between storing the date and reading it). If medium term scheduling across multiple countries is something that needs to work in your app, you're kind of stuck with a column with a timestamp and another column with a timezone.
A lot of these aren't postgres-specific. (null weirdness, index column order, etc.)
For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.
If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
> If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
FWIW, since 15 postgres you can influence that behaviour with NULLS [NOT] DISTINCT for constraints and unique indexes.
> Normalize your data unless you have a good reason not to
Ouch. You don't want to just say that and move on.
The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.
The one exception I'll make from the very start is "add tenant identifier to every row, yes, even if it's linked to another table that has tenant identifiers."
Sure, this means you will have some "unnecessary" `tenant_id` columns in some tables that you could get through a relation, but it saves you from _having_ to include that relation just to limit by tenant, which you will some day almost be guaranteed to want. (Especially if you want to enable row-level security¹ someday.)
It's usually to normalize into the 3rd form. But that's not enough on some cases, that's too much on some other cases, and the reason it breaks is performance about as often as it's not.
I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.
Auto vacuuming is enabled now. We did have some near misses due to long running vacuums that barely completed before wraparounds, but got things tuned over time.
I did none of that work but was on the team where it happened.
Long before databases could even store structured JSON data, junior developers used to bikeshed viciously over the correct degree of normalization.
More experienced developers knew that the correct answer was to duplicate nothing (except for keys obviously) and then to denormalize only with extreme reluctance.
Then databases like mongo came along and encouraged those juniors by giving them something like a database, but where normalization was difficult/irrelevant. The result was a brief flowering of horrible database designs and unmaintainable crap towers.
Now the pendulum has swing back and people have rediscovered the virtues of a normalized database, but JSON columns provide an escape hatch where those bad practices can still flower.
Eh. JSON has its place. I have some stateful data that is fairly transient in nature and which doesn’t really matter all that much if it gets lost / corrupted. It’s the sort of thing I’d throw into Redis if we had Redis in our stack. But the only storage in my project is S3 and Postgres. Postgres allows me to trivially query, update, analyze the usage of the feature, etc. Normalization wouldn’t buy me much, if anything, for my use case, but it would make “save this stuff for later” more of a nuisance (a sync across a bunch of rows vs a single upsert).
That said, I’ve worked on projects that had almost no normalization, and it was pure hell. I’m certainly not arguing against normalizing; just saying that data blobs are useful sometimes.
1. To store JSON. There's a pattern where when your webserver calls into some third-party API, you store the raw API response in a JSONB column, and then process the response from there. This gives you an auditable paper trail if you need to debug issues coming from that 3rd-party API.
2. To store sum types. SQL not supporting sum types is arguably the biggest deficiency when modelling data in SQL databases. There are several workarounds - one of them being "just chuck it in a JSONB column and validate it in the application" - but none of the workarounds is particularly great.
Even if you care about it, you will still often wind up with a junk drawer of JSONB. I don't really see it as a problem unless people are writing bad queries against it instead of lifting values out of the JSONB into their own columns, etc.
Most developers using these kinds of tools these days are actually building their own database management systems, just outsourcing the persistence to another DMBS, so there isn't a strong imperative to think about good design so long as it successfully satisfies the persistence need.
Whether we actually should be building DMBSes on top of DMBSes is questionable, but is the current state of affairs regardless.
A previous employer thought that sql databases didn’t understand graphs. So they made their own system for serializing/deserializing graphs of objects into Postgres
. They never used queries and instead had their own in-memory operators for traversing the graph, had to solve problems like deleting an entry and removing all references, partial graph updates.
This needs working schema migration process, including ability to undo schema change if the new column tanks the performance or breaks stuff.
If there are CLI tools involved, you also need to ensure you can handle some downtime, or do synchronized version update across company, or support both old and new schemas for a while.
If a database is not part of team's primary product all of this could be missing.
this is really nice. i am glad the author put it together. i didn't know the pg docs were 3200 pages long! i have been using it for a while and try to learn as i go. i really do like the docs. and i also like to read articles on various particular subjects as i find a need to.
i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.
(i didn't spend much time on the json/jsonb part since i rarely use that stuff)
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
Personally, given SQL's intended similarity to English, I find that I like English "sentence case" for it, with the opening keyword starting with a capital letter and nearly every remaining letter lower case (except for Proper Nouns, the truly case-sensitive parts of SQL like table names). Sentence case has been helpful to me in the past in spotting things like missing semicolons in dialects like Postgres' that require them, and/or near keywords like `Merge` that require them or helping to visually make sure the `select` under a `Merge` is intended as a clause rather than starting a new "sentence".
Specifically, if I'm banging out an ad-hoc query that no one will ever see, and I'm going to throw away, I don't worry about casing.
Otherwise, for me, all SQL that's checked in gets the commands in ALL CAPS.
E.g., the last time I used the psql program, I don't think it had colorization of the SQL, despite running in a color-capable terminal emulator.
It probably doesn't help that terminal colors are a bit of mess. E.g., piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.
This is mostly irrelevant since you really shouldn't be mixing quoted with unquoted identifiers, and introspection largely isn't standardized.
Usual caveat of how nobody sticks to the ANSI standard anyway applies.
Deleted Comment
There is sqlfluff etc but I'm curious what people use.
[1] DuckDB SQL dialect is very close to Postgres, it's compatible in many ways but has some extra QOL features related to analytics, and lacks a few features like `vacuum full`;
If you don't:
- https://www.depesz.com/2022/09/21/prettify-sql-queries-from-...
- https://gitlab.com/depesz/pg-sql-prettyprinter
Or https://paste.depesz.com for one-off use.
Dead Comment
I thought he was talking about psql's case sensitivity with table names, which is incredibly aggravating.
Considering most programming languages do just fine without ALL CAPS KEYWORDS I'd say it's a strange effort. I wish SQL didn't insist on being different this way.
I agree with you on prettification though. As long as the repository chooses a prettifier you can view it your way then commit it their way. So that's my advice: always demand prettification for pull requests.
Obviously not very foolproof, but automated linters/prettifiers like the one in DataGrip do a good job with this for every query I've ever thrown at it.
The combination of all caps keywords + following "the river" whitespace pattern dramatically improves readability in my opinion
All caps letters are more similar and harder to read.
But getting some editor that highlights the SQL will completely solve your issue.
I think the increased legibility for visual pattern matching also makes SQL easier to read for many of the 350 million color blind people in the world.
[0] https://wiki.postgresql.org/wiki/Don%27t_Do_This
Made me reconsider a few habits I picked up from MySQL land
For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.
If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
FWIW, since 15 postgres you can influence that behaviour with NULLS [NOT] DISTINCT for constraints and unique indexes.
https://www.postgresql.org/docs/devel/sql-createtable.html#S...
EDIT: Added link
Ouch. You don't want to just say that and move on.
The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.
And they're right! I've had to fix a few issues of this in a project I recently got moved to. There's almost never a reason to duplicate data.
Deleted Comment
Sure, this means you will have some "unnecessary" `tenant_id` columns in some tables that you could get through a relation, but it saves you from _having_ to include that relation just to limit by tenant, which you will some day almost be guaranteed to want. (Especially if you want to enable row-level security¹ someday.)
¹ - https://www.postgresql.org/docs/current/ddl-rowsecurity.html
It's usually to normalize into the 3rd form. But that's not enough on some cases, that's too much on some other cases, and the reason it breaks is performance about as often as it's not.
I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.
And we did run out of transaction IDs, which is why I was forced to do it.
I never turned on the auto-vacuumer but I did set up a daily vacuum.
Keep in mind I left Reddit 13 years ago and I’m sure they’ve made improvements since.
I did none of that work but was on the team where it happened.
More experienced developers knew that the correct answer was to duplicate nothing (except for keys obviously) and then to denormalize only with extreme reluctance.
Then databases like mongo came along and encouraged those juniors by giving them something like a database, but where normalization was difficult/irrelevant. The result was a brief flowering of horrible database designs and unmaintainable crap towers.
Now the pendulum has swing back and people have rediscovered the virtues of a normalized database, but JSON columns provide an escape hatch where those bad practices can still flower.
That said, I’ve worked on projects that had almost no normalization, and it was pure hell. I’m certainly not arguing against normalizing; just saying that data blobs are useful sometimes.
I'm fine with using it for simple throw away stuff, but deciphering someone else's ball of json is soul killing.
1. To store JSON. There's a pattern where when your webserver calls into some third-party API, you store the raw API response in a JSONB column, and then process the response from there. This gives you an auditable paper trail if you need to debug issues coming from that 3rd-party API.
2. To store sum types. SQL not supporting sum types is arguably the biggest deficiency when modelling data in SQL databases. There are several workarounds - one of them being "just chuck it in a JSONB column and validate it in the application" - but none of the workarounds is particularly great.
3. End-user extra fields. Stuff you don't care about, but someone somewhere does.
Whether we actually should be building DMBSes on top of DMBSes is questionable, but is the current state of affairs regardless.
. They never used queries and instead had their own in-memory operators for traversing the graph, had to solve problems like deleting an entry and removing all references, partial graph updates.
And I still don’t think it works.
If there are CLI tools involved, you also need to ensure you can handle some downtime, or do synchronized version update across company, or support both old and new schemas for a while.
If a database is not part of team's primary product all of this could be missing.
i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.
(i didn't spend much time on the json/jsonb part since i rarely use that stuff)