Historically, 4NF explanations are needlessly confusing

I think the typical presentation of 4NF makes more sense when you consider the ergonomics of data processing systems when relational modeling emerged. Predating computing, data processing was all centered around "records." Relational databases are often also described in terms of "records," relational modeling emerged out of an attempt to formalize data processing practices, the practitioners of the era were learning relational modeling from a data processing background, so it's no surprise that relational concepts are often presented "as opposed to" the way things were done in data processing.

In traditional data processing systems, a "record" was far more independent than we think of in a relational database. Records were hand-punched onto cards or keyed onto magnetic tape to be sorted, summarized, or whatever operation you cared about by a data processing machine. In this environment, joins were extremely expensive operations, often requiring that the operator feed the card stack over and over again (a rather literal O(n^2)). So, the "weird composed" schema is absolutely what you would do. And a lot of computer software was built around the exact same concepts, which made sense anyway as computers often continued to use sequential-access storage devices with similar (but less severe) problems around joins. This era famously persisted for a long time, with relational modeling as an academic concept well predating successful implementations of relational databases.

One could argue that all of the normal forms are pretty much "stop doing it the old sequential access (card-stack) way and use random access concepts (keys and joins) instead."

Of course that leaves the question of whether or not we should teach it that way... we don't tend to tell students about how memory is now random access, so perhaps by the same turn the historical approach isn't useful for teaching here. But it would undoubtedly be more effective if you give a little of the history.

numpad0 · 9 months ago

I wonder how many people actually have seen those index card shelves[1] IRL. Everyone have heard enough of music box analogy, the Tacoma Narrows, etc, but I don't remember this technology covered often enough.

One set of shelves has cards for books sorted by its ISBN. Another by titles, by authors, by keywords, ... libraries(of books, of employee records, of manufacturing blueprints...) would have had whole rooms worth of these shelves for searching and indexing.

Data normalization, SELECT, and JOIN, are just separating information on cards into separate shelves, hand picking cards, and laying out matching ones side by side on photocopiers. I've never had to mess with it, but apparently that was literally the database.

1: https://en.wikipedia.org/wiki/Library_catalog

kazinator · 9 months ago

Records had to be batched into blocks and blocks had to be organized for fast access on a spinning drum or platter. This is why we have B+ trees for databases, and not simple binary trees or whatnot. You need the branching factor in order to reduce the seek times: each tree node traversal goes to a different block of records. If they are on the same track, it takes rotational latency to get to the block. If they are on a different track, seek time plus rotational latency.

Reducing rotational latency is why we got 7200 RPM and faster drives: going from 3600 to 7200 means any block is available twice as often at the drive head. It doesn't matter how many or how few have been crammed onto the track; density doesn't fix latency.

pessimizer · 9 months ago

What might be needless is the "4" in "4NF," because the first three (and a half) are obsolete. Or rather, as you mention, were only relevant when things were read sequentially. "Normalization" is really fine on its own.

Things that aren't normalized are "ad hoc," or really just "input."

cma · 9 months ago

Sequential is still faster in a lot of cases even on SSDs and RAM (prefetching)

petalmind · 9 months ago

> Things that aren't normalized are "ad hoc," or really just "input."

I agree that every denormalization is "ad hoc". It may not even feel this way because sometimes such ad-hocs become intimately tied into the system performance architecture.

However I'm not sure if I agree with "just input", or maybe that's a different way of saying that: sometimes (often) you denormalize from 3NF to avoid a join. Often you can demonstrate that avoiding join is benefitial, by looking at query plan.

But the problem with teaching is that people become generally more afraid of joins, somehow thinking that every join is killing performance or something. ChatGPT produces a lot of FUD around joins and performance, unfortunately...

petalmind · 9 months ago

Thank you, the second paragraph of your comment is illuminating. There are some more replies in the same vein. I did not understand the mutual independence of "cards", and how you can use a different data layout if you only need to put it on a single card.

> with relational modeling as an academic concept well predating successful implementations of relational databases.

This is also non-obvious, thank you!

Perhaps TFA could explain what happened to math between 1977 and 2024. Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF. The author pretends to explain 4NF, but never actually begins. He calls himself an "historian", which he might be. But he betrays very little knowledge of the relational model.

Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic. Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’. Nor will SQL join two teachers on that column.

SQL doesn’t recognize repeating fields in a column because it already has a way to represent them: as rows. That a teacher speaks two languages is not one fact, but two; not surprisingly, those two facts appear in two rows. ‘Twas true in in 1977. ‘Tis true today. Thus ‘twill ever be.

layer8 · 9 months ago

I mostly agree with you, however one could argue that character strings aren’t atomic as well. They are sequences (“arrays”, if you will) of characters, and query operators like LIKE demonstrate that strings aren’t just used as opaque values. You could in principle normalize character strings into a String table with columns “string ID”, “character position”, and “character value”. Somewhat similarly, types like DATETIME and INTERVAL can be regarded as non-atomic.

Whether a database column has a composite type is a trade-off regarding performance, storage requirements, and needing special operators for queries to work on the substructure of a composite value.

petalmind · 9 months ago

> Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF.

So what is the meaning of "atomic"? Are array-typed values atomic? (I think you can treat it that way, a lot of lists are immutable in different programming languages.)

> But he betrays very little knowledge of the relational model.

This is absolutely possible. Let me ask a couple of questions to borrow from your understanding of relational model.

> Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic.

Sorry, what is "repeating fields"? I don't remember ever using this, how does it look like? Are you talking about array-typed columns? If yes, then why are they not atomic?

Also, I don't think I proclaim absolutely anything is "kind of innovation"? Postgresql supports array-typed columns since 2001, and it was planned in 1986 already as part of other "post-relational" features.

Two schemas "(instructor_id, List<skill_id>); PK(instructor_id)", and "(instructor_id, skill_id); PK(instructor_id, skill_id)" are isomorphic, thus I may be wrong but I don't see any anomalies here.

> Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’.

Ah, so you seem to be talking about array-typed columns. I mean, you can find it easily by using "WHERE languages @> ARRAY['FRENCH']". The difference is mostly syntactical.

If you think that you are not allowed to access single elements of "languages" array because it violates "atomicity" of the value, then we should also say that you're not allowed to use queries like "WHERE name LIKE 'A%'", because they violate the atomicity of a string value. Don't you agree?

> SQL doesn’t recognize repeating fields in a column

Array-typed columns appeared in ANSI SQL:1999, a quarter of century ago. I'm not sure what you're talking about.

arh68 · 9 months ago

Atomic means nondecomposable [1]. I dare say array-typed values are not atomic.

Re: "(instructor_id, List<skill_id>); PK(instructor_id)”, are your Lists sorted ? Are your Lists actually Sets?: what prevents `insert values (100, [2, 2, 1])` ?

If you think those are isomorphic, how about “(json_data); PK(json_data)” ? With, you know, all the same data in the JSON. Tomato, tomato ?

I’ll just link this [2] PDF that seems to describe the present

> Does this mean that SQL:1999 allows databases that do not satisfy first normal form? Indeed, it does, in the sense that it allows “repeating groups”, which first normal form prohibits. (However, some have argued that SQL:1999’s ARRAY type merely allows storage of information that can be decomposed, much as the SUBSTRING function can decompose character strings—and therefore doesn’t truly violate the spirit of first normal form.)

[1] at least, according to Codd

[2] https://www.cl.cam.ac.uk/teaching/0304/Databases/sql1999.pdf

jcrawfordor · 9 months ago

Kerrick · 9 months ago

I’ve seen this design again and again, but not in relational databases. I usually see it in spreadsheets. I would be completely unsurprised to see a three column Excel sheet with Employee, Skill, and Language headers. (https://kerricklong.com/tmp/demo-4nf-need.jpg) The spreadsheet designer’s intuition is that Skill and Language are lists, and Employee is a section header. Smith might have a few rows with his languages and skills listed, one per cell, in arbitrary top-to-bottom order and with no consideration for how language and skill in a row relate to each other—only to Smith. The Employee column has a single merged cell for Smith, a number of rows tall it needs to be to contain the largest of his lists. When migrating this from a spreadsheet to a database, you lose merged cells as a feature and can no longer rely on order—so you copy the value Smith to each cell that had been merged.

This is a very good graphical demonstration (especially the merged cells), thank you. It is also a variation "2b" from Kent's list (just for reference).

I think that the textual demonstration of the same effect could be found in https://www.cargocultcode.com/normalization-is-not-a-process... ("So where did it begin?" section). I rediscovered this blog post after I published the article.

jklowden · 9 months ago

somethingsome · 9 months ago

>20 years ago, when I started with databases, I found natural to map a user object in my software with a unique 'user' table with many columns. (Excel style)

It's only when I discovered how things were implemented and the required space and combinatorial complexity that the 4NF started to make sense.

I think the natural 'naive' approach is to have many columns with redundant information inside.

At some point, with more experience, we see how this system is harder to maintain (burden on the code logic) and how much space is wasted. I discovered 4NF (without giving it a name) by myself to solve those issues.

The problem is that we teach 4NF to people that never implemented code with databases and had to solve the problems it generates if you only have multiple columns. So the examples seems artificial.

When for the first time I saw 4NF in my lectures, it was easy as I knew all the issues without.

> I think the natural 'naive' approach is to have many columns with redundant information inside.

Interesting, thank you. I either don't remember that phase in my own learning, or I was lucky to read about the "correct" approach first. Maybe that's why I was so extra confused about 4NF explanations, haha.

miki123211 · 9 months ago

I think it depends on what intuitions you start out with.

If you teach yourself programming as a kid like I have, and get to know about object-oriented programming very early on, normalization is more intuitive. "of course the addresses go in a separate table from the users, just as you'd have an `address` object instead of just putting all the fields inside `User`."

If you start out using Excel a lot, and learn programming much later, the Excel way probably seems right at first. "Of course there's going to be a table with customer information, one row per customer, with separate fields for street, city, zipcode etc."

If you maintain data records manually in Excel sheets, you tend to not “normalize” the data into several tables, because having to chase foreign-key references means you quickly lose track and don’t see everything related to a single entity in one place anymore. That’s because Excel doesn’t provide you with “joined” views. When inspecting a database interactively with a database GUI/TUI, this is very similar, unless you start building queries that lets you join the data together again (which is harder and less intuitive than just browsing tables and filtering/searching in them).

Of course, not normalizing the data makes it more laborious and error-prone to keep things consistent, the more non-1:1 data associations there are. When databases are not presented as an interactive table-browsing tool, but as an application’s data store, where the application can join data as needed, then the benefits of having everything related to an entity “in one place” (in a single table) are outweighed by the drawbacks of non-normalized data. However, this is not obvious when you start out in the table-browsing paradigm. People naively tend to structure the data the way they would for manual organization.

wpollock · 9 months ago

It was and is still common to have non experts design databases. And in many cases, normal form doesn't make sense. Tables with hundreds or thousands of columns are commonly the best solution.

What is rarely stressed about NF is that update logic must exist someplace and if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.

In the 1990s I was a certified Novell networking instructor. Every month their database mysteriously delist me. The problem was never found but would have been prevented if their db was in normal form. (Instead I was given the VP of education's direct phone number and he kept my records on his desk. As soon as I saw that I was delist, I would call him and he had someone reenter my data.)

Adding fields to tables and trying to update all the application code seems cheaper than redesigning the schema. At first. Later, you just need to normalize tables that have evolved "organically", so teaching the procedures for that is reasonable even in 2024.

> if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.

This is a common sentiment but I think that you can always demonstrate a use case that is very much practical, not convoluted, and could not be handled by classical features of relational databases, CHECK primarily (without arrays of any kinds).

A couple of examples is outlined here: https://minimalmodeling.substack.com/i/31184249/more-on-stru...

Basically I'm currently pretty much sure that it's impossible to "make invalid states unrepresentable" using a classical relational model.

Also, I think that if your error feels subtle then you should elevate the formal model to make it less subtle. You can have "subtle" errors when the checks are encoded as database constraints too.

>Basically I'm currently pretty much sure that it's impossible to "make invalid states unrepresentable" using a classical relational model.

But that's what 9th NF is for! :-)

danparsonson · 9 months ago

> The problem was never found but would have been prevented if their db was in normal form.

Sorry to be pedantic - if they didn't find the problem, how do we know that was the solution?

I suppose I don't know for sure. But normal forms prevent false data creation and data loss, and I know their db was not normalized from conversations with their IT support, so it's a pretty good guess.

Deleted Comment

mgkimsal · 9 months ago

"The existence of the theory implies that the “composed” design somehow arises, naively(?) or naturally."

Given how many first timers come up with really weird schema designs, I'm not necessarily surprised, although I agree presenting it as perhaps a default-but-wrong approach doesn't help much.

anonzzzies · 9 months ago

Not only beginners. I work with large teams with almost only seniors (10+ years of programming job experience) for large corps who don't know what 4nf is; they just chuck stuff in a table, update the same data across multiple tables, don't use indices or cascading rules etc. Everything is in code; the db just stores it, badly. They threat everything like it's mongo and this doesn't even work in mongo (no indices and it dies as well of course).

4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.

With many things in software dev; it all doesn't matter until it matter. With few records and few users, like the average department LoB crud thing, it's not really a problem; it will be when either or both grow. My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general. That makes me a lot of money, but it is depressing really.

I was initially going to add some stuff I find a bit... shocking in the 2020s related to what I think are basic database stuff (maybe not specifically 4nf related). Held off, but I'll add one.

A colleague had to argue that an 'address' table should have a synthetic primary key, instead of a natural key. The argument for the initial 'design' held that 'the address itself is unique - there's not 2 '123 west main streets'!". Wasn't my company directly, but the crux of the conversation was shared with me, and ... in 2022, with however many articles, tutorials, books, classes, stackoverflow, etc... having to argue that 'no, someone's entire address should not be the natural key for a table' was ... flummoxing to me.

> 4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.

> My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general.

Wait I don't get it. Are you saying that "(instructor_id, skill_id); PK (instructor_id, skill_id)" (and the same for languages) is NOT in 4NF; and I am not right? What is in 4NF, then?

vinay_ys · 9 months ago

Performance characteristics of the underlying hardware dictate the software abstractions that make sense. In today's machines, we still have differential cost for sequential, aligned, rightly sized block io operations vs random io operations of random sizes. And we have a hierarchy of storages with different latency, performance and costs – cpu caches, ram – and ram, ssd, spinning disks – and these via local attached vs in disaggregated and distributed clusters. So, if you want absolutely optimal performance, you still have to care about your column sizes and order of columns in your records and how your records are keyed and how your tables involved in join operations are organized and what kinds of new queries are likely in your application. This matters for both very small scale (embedded devices) and very large scale systems. For mid-scale systems with plenty of latency and cost margins, it matters a lot less than it used. Hence, we have the emergence of nosql in the last decade and distributed sql in this decade.

mkleczek · 9 months ago

There are several issues with this:

1. 4NF is dependent on context. This context is described as dependencies.

In the example of employees, skills and languages the design with two separate "link tables" employee_skills and employee_language is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.

2. Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table. Once you start adding rules and language to specify referential integrity for multivalued attributes you very quickly end up with... splitting multivalued attributes into separate relations.

> is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.

In the "Baseline" section I do exactly that: specify requirements first. Nowhere does it say anything about skill depending on the language, so I assumed that it implies that they are independent.

"Suppose that we’re building a system that lets people find sports instructors. Each instructor has a number of skills they can teach: yoga, weightlifting, swimming, etc. Also, instructors can speak one or more languages: English, French, Italian, etc."

Not sure how to improve this wording so that it was more clear that skills and languages are independent.

> Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table.

I'm not sure what you mean here because multivalued attributes as defined in 1977 preserve referential integrity. But array-typed values do not, you're right.

This is a very important consideration, thank you! I did not realize that (because I don't really think that classical enforced FKs are that crucial in modern practice.) But from a theory point of view having a list of IDs is not enforced, yes.

It would take a really long time to count all the issues caused by dangling references to nonexistent rows I had to deal with in my career.

So yes - classical enforced FKs are still important - even if practice became „modern” (whatever that means )