Readit News logoReadit News
Posted by u/pcarolan 3 years ago
Ask HN: How do you test SQL?
I've been looking for resources for our data team to apply best practices for testing SQL pipelines (we use DBT) but have not found anything. How do you test SQL pipelines? What patterns, tools and best practices would you recommend? Any good reference material you know of?
RobinL · 3 years ago
Try and write any complex SQL as a series of semantically meaningful CTEs. Test each part of the CTE pipeline with an in.parquet and an expected_out.parquet (or in.csv and out.csv if you have simple datatypes, so it works better with git). And similarly test larger parts of the pipeline with 'in' and 'expected_out' files.

If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast

One challenge if you're using Spark is that test can be frustratingly slow to run. One possible solution (that I use myself) is to run most tests using DuckDB, and only e.g. the overall test using Spark SQL.

I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.

Also I have no idea whether this is good practice - it's just something that seemed to work well for me.

The approach with csvs can be nice because your customers can review these files for correctness (they may be the owners of the metric), without them needing to be coders. They just need to confirm in.csv should result in expected_out.csv.

If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe

One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this

paulmd · 3 years ago
If your database supports it, unit tests are an absolutely ideal use-case for temporary tables or global temporary tables. A global temporary table can be defined with the same schema as the correct table and will "exist" for the purpose of view/CTE definitions, but any data inserted into the table will only ever be visible from that specific thread context. The rules depend but basically either it exists until the thread is closed, or until the session calls commit, but the semantics will (theoretically) be optimized for one thread and rapid inserts/clears.

If you build your unit tests so that a thread will not commit until it's finished, or clear it before/after, it's pretty ideal for that. You can feed in data for that test and it won't be visible anywhere else.

Potentially if you scripted your table creates you could add an additional rule that GLOBAL TEMPORARY TABLE gets added to any other tabledefs but only for unit tests. Or just update both in parallel.

The other useful use-case I've found for this is that you can do an unlimited-size "WHERE myId IN (:1, :2 ...)" by doing a GTT expressed as "INNER JOIN myId = myGtt.myid", and this has the advantage of not thrashing the query planner for every literal sqltext resulting from different lengths of myList (this will fill up your query cache!). Since it's one literal string it always hits the same query plan cache.

I am told this has problems in oracle's OCI when using PDBs, apparently GTTs will shit up the redo-log tablespace (WAL equivalent). But so do many many things, PDBs are apparently a much different and much less capable and incredibly immature implementation that seems essentially abandoned/not a focus from what I've been told. I was very surprised having originally written that code in regular Oracle and not having had problems but PDBs just aren't the same (right down to nulls being present in indexes!).

n0n0n4t0r · 3 years ago
Off topic incoming (sorry )

I used this trick (join temporaryFoo instead of where foo in ...) in production fifteen years ago, using MySQL. The gain was really astonishing. Several instructions can be optimized using joins on specialty craft tables (I know of LIMIT for instance).

This is one of the worst drawbacks of orm everywhere: nobody even seems to think about those optimisations anymore.

MrPowers · 3 years ago
Spark makes it easy to wrap SQL in functions that are easy to test. I am the author of the popular Scala Spark (spark-fast-tests) and PySpark (chispa) testing libraries. Some additional tips to speed up Spark tests (can speed up tests between 70-90%):

* reuse the same Spark session throughout the test suite

* Set shuffle partitions to 2 (instead of default which is 200)

* Use dependency injection to avoid disk I/O in the test suite

* Use fast DataFrame equality when possible. assertSmallDataFrameEquality is 4x faster than assertLargeDataFrameEquality. Some benchmarks here: https://github.com/MrPowers/spark-fast-tests#why-is-this-lib...

* Use column equality to test column functions. Don't compare DataFrames unless you're testing custom DataFrame transformations. See the spark-style-guide for definitions for these terms: https://github.com/MrPowers/spark-style-guide/blob/main/PYSP...

Spark is an underrated tool for testing SQL. Spark makes it really easy to abstract SQL into unit testable chunks. Configuring your tests properly takes some knowledge, but you can make the tests run relatively quickly.

RobinL · 3 years ago
That's super useful, thanks. Could you expand on the 'Use dependency injection to avoid disk I/O in the test suite' point please - I'm not sure I understand what it means but it sounds interesting!
colanderman · 3 years ago
One caution with PostgreSQL, CTEs under some circumstances (and in all circumstances, prior to PostgreSQL 12) act as optimization barriers. Specify `NOT MATERIALIZED` before the CTE definition to ensure that they are optimized same as a sub-SELECT would be.
timwis · 3 years ago
Just to flag this behaviour changes in v12: https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...
nicoburns · 3 years ago
This is a good point, although it’s worth noting that there are definitely cases where you want the sun table to be materialised (esp. when that table is small and referenced many times)
gpvos · 3 years ago
CTE = common table expression, i.e. a WITH clause (possibly recursive) before your SELECT (or other) statement. https://learnsql.com/blog/what-is-common-table-expression/
stormdennis · 3 years ago
Yes, I didn't know what it was so I had looked it up before I saw yours. The definition I got is below:

A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements.

curiousllama · 3 years ago
> Try and write any complex SQL as a series of semantically meaningful CTEs

I find this helps a heck of a lot with maintainability + debugging as well

atwebb · 3 years ago
It can wind up with more of a procedural thought though than set-based. Not always, just something to pay attention to and pushing filters early / explicitly. It is asking more of the optimizer a lot of the times and that's were a bunch of the cautions come in to play.
shazzdeeds · 3 years ago
A few quick tips to help slow Spark tests.

1. Make sure you're using a shared session between the test suite. So that spin-up only has to occur once per suite and not per test. This has the drawback of not allowing dataframe name reuse across tests, but who cares.

2. If you have any kind of magic-number N in the SparkSQL or dataframe calls (e.g coalesce(N), repartition(N)) change N to be parameterized, and set it to 1 for the test.

3. Make sure the master doesn't have any more than 'local[2]' set. Or less depending on your workstation.

marijnz0r · 3 years ago
I really like this answer because it tests CTEs in a modular way. One question I have is: how do you test a the CTE with an in.csv and out.csv without altering the original CTE? Currently I have a chain of multiple CTEs in a sequence, so how would I be able to take the middle CTE and "mock" the previous one without altering the CTE actually used in production? I prefer not to maintain a CTE for the live query and the same CTE adapted for tests.
jimmygrapes · 3 years ago
My ignorance of the topic (and experience with a mostly unrelated one) is showing, but all I could think of when you said CTE was "chronic traumatic encephalopathy". This made a lot more sense when you generalized answering the question as if it's a given that Python is necessary (I know that's not your intent, but that's how it comes off).

Not much more to say, just observing, sorry if this is irrelevant commentary.

jph00 · 3 years ago
CTE==Common Table Expression. It's not specific to any particular language. It's basically like a view, except you can define them the same place as you use them, and they can be recursive.
db48x · 3 years ago
Sometimes using databases feels just like repeatedly slamming your head into your desk, so that fits.
osigurdson · 3 years ago
>> I could think of when you said CTE was "chronic traumatic encephalopathy"

The title is "Ask HN: How do you test SQL"

intrasight · 3 years ago
I'm a fan of CTEs too. Here's a pattern that I use in Sql Server for testing/debugging when using CTE pipelines. Same approach would work with "FOR JSON"

   declare @debug bit = 1;

   ;with cte1 as (
     select
       @debug AS Debug1,
       ...
   ),
   cte2 as (
     select 
       @debug AS Debug2,
       ... from cte1
   ),
   cte3 as (
     select 
        @debug AS Debug3,
        ... from cte2
   )
   select
     -- dump intermediate if debug
     (
       select * from cte1 where Debug1=1 for xml raw ('row'), root ('cte1'), type
     )
     ,(
       select * from cte2 where Debug2=1 for xml raw ('row'), root ('cte2'), type
     )
     ,(
       select * from cte3 where Debug3=1 for xml raw ('row'), root ('cte2'), type
     )
     -- final results
     ,(
         select ...
         from cte3 for xml raw ('row'), type
   )
   for xml raw ('results'), type;

devin · 3 years ago
> Try and write any complex SQL as a series of semantically meaningful CTEs.

Could you or anyone else on the post provide an example?

paulmd · 3 years ago
let's say you are doing a paystub rollup - a department has multiple employees, an employee has multiple paystubs.

if you are storing fully denormalized concrete data about the value of salary/medical/retirement both pre- and post-tax that was actually paid to each pay period (because this can vary!), then you can define a view that does salary-per-employee (taxable, untaxable, etc), and then a view that rolls up employees-per-department. And you can write unit tests for all of those.

that's a super contrived example but basically once group aggregate or window functions and other complex sub-queries start coming into the picture it becomes highly desirable to write those as their own views. And you can write some simple unit tests for those views. there are tons of shitty weird sql quirks that come from nullity/etc and you can have very weird specific sum(mycol where condition) and other non-trivial sub-subquery logic, and it's simple to just write an expression that you think is true and validate that it works like you think, that all output groups (including empty/null groups etc) that you expect to be present or not present actually are/aren't, etc.

I'm not personally advocating for writing those as CTEs specifically as a design goal in preference to views, personally I'd rather write views where possible. But recursive CTEs are the canonical approach for certain kinds of queries (particularly node/tree structures) and at minimum a CTE certainly is a "less powerful context" than a generalized WHERE EXISTS (select 1 from ... WHERE myVal = outerVal) or value-select subquery. it's desirable to have that isolation from the outer SQL query cursor imo (and depending on what you're asking, it may optimize to something different in terms of plan).

Writing everything as a single query, where the sub-sub-query needs to be 100% sure not to depend on the outer-outer-cursor, is painful. What even is "DEEP_RANK()" in the context of this particular row/window? If you've got some bizarre (RANK(myId order by timestamp) or whatever, does it really work right? Etc. It's just a lot easier to conceptually write each "function" as a level with its own unit tests. Same as any other unit-testable function block, it's ideal if it's Obviously Correct and then you define compositions of Obvious Correctness with their own provable correctness.

And if it's not Obviously Correct then you need the proof even more. Encapsulate whatever dumb shit you have to do to make it run correctly and quick into a subquery and just do a "inner join where outerQuery.myId = myView.myId". Hide the badness.

Dead Comment

ChadMoran · 3 years ago
> Try and write any complex SQL as a series of semantically meaningful CTEs

I used this exact same method. Not only does it help me but those who come after trying to understand what's going on.

tensor · 3 years ago
Decomposing a large query into smaller subsets is the right approach, but I would strongly suggest doing it with VIEWs rather than CTEs. CTEs are a useful tool, but come with their own performance profiles and often using them too much will lead to slower queries.
mritchie712 · 3 years ago
Doesn't that mean you have to copy all your data into duckdb? I'd imagine with even a modest data warehouse, loading the entire thing to duckdb would be unfeasible.
RobinL · 3 years ago
The tests are being run against small test datasets of 10s of rows rather than the real data that test whether the behaviour of transforms/joins etc. is as expected. You're right, this approach wouldn't be sensible if the tests have to use large production tables
ramenmeal · 3 years ago
We spin up a docker container running the DB technology we use, run our DB migration scripts on it, and then run integration tests against it. You get coverage of your migration scripts this way too.
MBCook · 3 years ago
This is what we did at my last job. You can catch DB specific issues that a false implementation wouldn’t show and make sure all your code paths work as expected.

Every time new issues cropped up we would put new data in the test data designed to reproduce it. Every edge case we would run into.

It provided so much confidence because it would catch and trigger so many edge cases that testing with mocks or by hand would miss.

Edit: also, it’s great for test/dev environments. You don’t have to worry about losing important data or filling new environments with data. Just start with the full test data and you’re good to go. It’s got stuff for all the corner cases already. Something got screwed up? Blow it away and reload, it’s not precious.

jdc0589 · 3 years ago
same here. every backend service with a sql datastore runs an real ephemeral db instance during CI, runs the normal migrations against it, and uses it for tests of the direct data access/persistence code. everything else in the service with a dependency on the data-access stuff gets a mocked version, but testing the actual data access code against a real datasource is non-negotiable IMO.
acheton · 3 years ago
We did something similar, being a small company we used batch files (checked into source control) to run database migrations on the different platforms we supported. Most database platforms have command line tools, although you need to be careful as there can be subtle differences in behaviour between the command line tools & those running with a UI.
fatneckbeard · 3 years ago
thats a huge amount of work and money.

at my company they just told us to stop reporting edge cases. much easier, much cheaper.

feike · 3 years ago
We do this too for PostgreSQL: to ensure the tests are really fast:

    - we create a template database using the migrations
    - for *every* integration test we do `CREATE DATABASE test123 TEMPLATE test_template;`
    - we tune the PostgreSQL instance inside Docker to speed up things, for exampling disabling synchronous_commit
On a successful test, we drop the test123 database. On a failed test, we keep the database around, so we can inspect it a bit.

The really great thing about this approach (IMHO), is that you can validate certain constraint violations.

For example, exclusion constraints are great for modelling certain use cases where overlapping ranges should be avoided. In our (go) code, the test cases can use the sqlstate code, or the constraint name to figure out if we hit the error we expect to hit.

This approach is pretty much as fast as our unit tests (your mileage may vary), but it prevents way more bugs from being merged into our codebase.

neonate · 3 years ago
Out of curiosity, how fast is really fast?
pgoggijr · 3 years ago
I've used this same approach as well. Testcontainers is a nice way to help with this!

https://www.testcontainers.org/

JackFr · 3 years ago
I rolled my own with docker for a few years and recently made the switch to testcontainers. So far so good - but if you’re in an environment or language where test containers are difficult, rolling your own really it ant to hard. It also keeps you honest with maintaining good migration scripts.
jve · 3 years ago
What is testcontainer? Reading the first page I don't understand what benefits it buys me:

> Testcontainers for .NET is a library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions. The library is built on top of the .NET Docker remote API and provides a lightweight implementation to support your test environment in all circumstances.

Edit: Ok, example helps. https://dotnet.testcontainers.org/examples/aspnet/

I can declare docker infrastructure from my code, right?

claytonjy · 3 years ago
Yup, same. Last time i set this up i used Sqitch¹ for migrations, which encourages you to write tests for each migration; caught a lot of bugs early that way, all in a local-first dev environment. Worked especially well for Postgres since plpgsql makes it easy to write tests more imperatively.

¹: https://sqitch.org/

grilledcheez · 3 years ago
At my job, we're breaking down a monolith into services with a hand-me-down database schema. DB changes are manual, every dev runs against a shared test DB, and everybody dreads doing schema changes. I've been looking for a way to transition into version controlled migrations and it looks like sqitch might be a solid option, as the language-specific frameworks are too opinionated. Thanks for recommending!
Dowwie · 3 years ago
I resented writing the verify scripts for my migrations, after writing unit and integration tests, but yes it is valuable
kakwa_ · 3 years ago
We do that too, in fact it's not only the DB that run in docker, but the whole build + CI process.

Our overall strategy is to create a master "test" DB with a test dataset, and for each test, copy this master DB to a test specific DB (CREATE DATABASE <testdb> TEMPLATE <master>) so that tests can run in parallel without interfering with each other and without the significant overhead of a "from scratch" DB initialization.

For schema migrations, we build the "branch/PR" version and the "main" version, then we check that 'init DB with "main" + migration with the "branch/PR" version' results in the same schema as 'init the DB directly with the "branch/PR" version' using apgdiff.

This strategy could probably be extended to migrating from every older version by building each tag, but we don't have that need.

We could also probably improve checks on the data itself however as for now, we only check the schemas.

Few things to note:

* it's still possible to run the tests outside of docker and use a local DB instead with some light setup (it's faster than running everything in docker when developing)

* docker argument --tmpfs <dir/of/db> is quite good, assuming you have enough ram for your dataset

* few configuration tweaks on the DB, like max connection might be necessary.

Overall, we are quite happy with this setup as it permits to implement end to end integration tests quite easily without spending too much time mocking dependencies.

As a general pattern, I find instantiating dependencies internal to your service (like a DB or Queue) to be the way to go, with mocking only for external dependencies (like external APIs) or exceptionally to reach a specific code branch (specially error handling sections).

ddyevf635372 · 3 years ago
Running sqlite in memory as a test db speeds up your test runner as crazy. You can do this if you use an sql query builder library, because it can translate your queries to the specific database.
whalesalad · 3 years ago
This can be a good fast/local test or maybe a sanity test ... but there are definitely differences between databases that need to be accounted for. You wanna take that green test pass with a bit of skepticism. So you always want to test on the same DB engine that is running your prod workloads. If your surface area is small, you can get by with the approach you mentioned, but it would need to be marked tech debt that should be paid down as soon as possible, or as soon as that bug that manifests itself in PSQL but not sqlite appears :)
danielheath · 3 years ago
You can get the 'in-memory' speed advantage by putting the datastore on a ramdisk (or even just disabling fsync, which is pretty easy to do in postgresql).
rc_mob · 3 years ago
problem is that its not always compatible with features you use on your production database
gardenhedge · 3 years ago
So you test against a different database technology than the one you software uses? I understand why that works but it seems odd
sicp-enjoyer · 3 years ago
Why should this be faster than a local postgres instance with no traffic?
crabbone · 3 years ago
This may work for something simple, but a typical database cluster setup will be impossible / impractical to try to emulate in containers because you'd need to configure a lot of things not normally available inside containers (s.a. how storage is attached, how memory is allocated).

Since OP mentioned DBT (kind of weird, hopefully, it's at least DBT2, since DBT is very old), they mean to test the productivity of the system rather than correctness of some queries (typical tests that deal with workloads similar to DBT2 are, eg. pgbench). Running pgbench over a database setup in a container will tell you nothing / might just confuse you, if you don't understand the difference between how database is meant to be setup and what happens in container.

atmosx · 3 years ago
Yeah. That’s it. I have seen method with sample data (low fidelity) and/or performed against production copies of RDS DB (high fidelity). It’s still hard to catch some migrations or other operations under workload, but you can emulate that as well to a certain point.
hot_gril · 3 years ago
Same (though I don't use Docker). Did TDD for a while like this, and it wasn't perfect, but it worked better than anything else. I didn't even know how to run the frontend; that was a separate team.
inetknght · 3 years ago
bonus points if you add test data for integration tests with sad paths too
csteubs · 3 years ago
Absolutely this. I stood up a negative test suite for continuous DB queries late in 2020 and it's caught many potential show stopper integration issues since; about 45% more YoY than pre-suite.

Took about a week of duplicating happy path tests, investigating gaps, and switching inputs/assertion types to get everything passing, but less than a week later we had our first verifiable test failure.

allanbreyes · 3 years ago
This is a great way to test for backwards-incompatible changes if your fleet is running canaries or instances on different versions backed by a singleton database. You apply the migrations, checkout the app from the old version, and then re-run your test suite. Any failures are a reasonably high signal that some backwards-incompatible migration was introduced.
MuffinFlavored · 3 years ago
Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?
nishs · 3 years ago
Sorry for the digression first. (If anyone has different definitions for the ideas here, I would love to learn.)

I think the answers would depend on the types of tests that the term "this" encompasses. From how I understand it, calling something a unit test or an integration test depends on the context of what is is being tested. For example, if a developer is writing a unit test for a HTTP handler, and the handler implementation includes calls to an external database, then one would have to use a mock for the database, in order for the test to be deemed a true unit test [1]. If the developer is writing an integration test for the same HTTP handler, then the database would have to be a real database implementation [2].

On other hand, if the developer were testing SQL queries for syntactical or logical correctness against a database, these tests would want to use a real database implementation. Note that though the test uses a real database, it is still a true unit test [3]. Additionally, note that using a mocked database here would not serve the purpose of the tests, which is to catch syntactical and logical errors against a database engine. This can, of course, only be achieved by using a real database—or, if you insisted on using a "mock", then, by implementing an entire SQL engine, with the exact same quirks as the real database's engine, inside the mock!

On the original question:

> Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

I guess the answer would be: It would depend on the objectives of and types of tests. Do both of them, because some tests, such as unit tests on the HTTP handler, would use use mocks, while other tests, such as the SQL query correctness tests, would use the real database.

[1] A true unit test is one that has no external interactions, neither directly nor transitively, besides interactions with the system under test (SUT). The SUT here is the HTTP handler.

[2] An integration test should include external system interactions, if any. That's what integration means.

[3] The SUT is each SQL query and the its interaction with the database. There are no interactions in the test with systems outside the SUT, so it is still a true unit test.

sssspppp · 3 years ago
would love to do this, but how does one spin up a redshift cluster inside of a docker container?
ntr-- · 3 years ago
try not to cry, cry a lot, and then resolve not to vendor lock yourself to a black box data store next time.

jokes aside, redshift is based on pg^1, you can try an older version to get some semblance of it running locally.

1. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-an...

nicoburns · 3 years ago
And this is why you don’t rely on a closed source stack if you have any alternative.
danmcs · 3 years ago
Redshift speaks the postgres protocol so you might be able to use postgres. There are a few purpose-built docker images (googleable) that may replicate Redshift slightly better than just `docker run -p 127.0.0.1:5439:5432 postgres:latest`, but if you're at the point of having a test suite for your data warehouse code, you're likely using Redshift-specific features in your code and postgres won't suffice.

I have seen teams give each developer a personal schema on a dev cluster, to ensure their Redshift SQL actually works. The downside is that now your tests are non-local, so it's a real tradeoff. In CI you probably connect to a real test cluster.

contravariant · 3 years ago
Well the first step is to get Amazon to part with their lucrative closed source software.
atwebb · 3 years ago
Would redshift serverless help at all? I realize that it may not have parity with the existing functionality, just a thought.
totalhack · 3 years ago
Same. That's one of the benefits of being able to replicate your environment with something like docker compose and then use a tool like alembic to manage migrations both locally and in production.

Deleted Comment

unboxingelf · 3 years ago
This is the way
purerandomness · 3 years ago
There's pgTAP for Postgres [1], the same approach probably is valid for other databases.

Here's [2] a slide deck by David Wheeler giving an introduction into how it works.

[1] https://pgtap.org/

[2] https://www.slideshare.net/justatheory/unit-test-your-databa...

vogtb · 3 years ago
+1 for pgtap. Surprised it's not been mentioned more in this thread.

I'm using it w/ supabase, and it works really well.

They have a small doc on it that's a better primer than the pgtap docs: <https://supabase.com/docs/guides/database/extensions/pgtap>.

Pretty easy to get started, I'm doing something like this in a Makefile.

  supabase_tests := $(call rfind, supabase/tests/*_tests.sql)
  DB_URL := "postgresql://postgres:postgres@localhost:54322/postgres"
  test-supabase: $(supabase_tests)
    @echo "Testing supabase..." && \
        echo "${supabase_tests}" && \
        psql -f supabase/tests/setup.sql ${DB_URL} && \
        $(foreach t,$(supabase_tests),\
            echo "Test: $(t)..." && psql -f $(t) ${DB_URL} $(__EXEC)) && \
        psql -f supabase/tests/teardown.sql ${DB_URL}

kiwicopple · 3 years ago
gregplaysguitar · 3 years ago
We used to use pgtap extensively, but ended up removing it in favour of testing queries as part our regular nodejs application tests, via a db client - the ergomonics of the test tooling is far better, and I don't think we really lost anything.
nesarkvechnep · 3 years ago
Unfortunately, I’m not surprised people test queries in their applications’ unit tests. What they’re actually testing is the ORM/query builder. Instead, with pgTAP, you can test specifically your queries.
adulion · 3 years ago
I was going to mention pgtap as i had used it in a previous role and it works but its cumbersome. I was hoping for a better solution by reading the comments
efxhoy · 3 years ago
I write mostly batch ETL stuff. All plain psql and bash. We don’t have a good testing setup to be honest. What we do use instead:

Plenty of constraints, uniques and foreign keys and not nulls. Enum types.

Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.

Asserts in DO blocks, mostly counts being equal.

Build tables in a a _next suffix schema and swap when done.

Never mutating the source data.

Using psqls ON_ERROR_STOP setting.

Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.

“Wasting” machine resources and always rebuilding from scratch when feasible. CREATE TABLE foo AS SELECT is much simpler than figuring out which row to UPDATE. Also ensures reproducibility, if you’re always reproducing from scratch it’s always easy. State is hard.

Overall i’m quite happy with the workflow and very rarely do we make mistakes that unit tests would have caught. Our source data is complex and not always well understood (10+ years of changing business logic) so writing good tests would be very hard. Because we never touch the raw source data any errors we inevitably make are recoverable.

This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4

toto444 · 3 years ago
I admit it's not the best but I do almost like you. Graphs are definitely helpful.

I also have a few 'test' queries that insert there results into a 'test_results' table. Most of the queries check the cardinality of the table since to me wrong cardinality is where the biggest errors come from. I do something like :

insert into test_table

select case when count(*) = count(distinct users) then 'pass' else 'fail' end as result, 'test_cardinality_temporary_table_a' as test_name from temporary_table_a

vivegi · 3 years ago
There's some great advice here for batch ETL work. The Create Table .. As Select (CTAS) pattern is your friend and a great aid in testability.
drx · 3 years ago
If you're using dbt, dbt tests are a good start: https://docs.getdbt.com/docs/build/tests

You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.

Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)

mritchie712 · 3 years ago
I'm surprised dbt tests is this far down in the comments. This seems like the obvious place to start.

OP (or others) - If you've used dbt tests, I'm curious where it fell short? Tt doesn't cover everything, but it's pretty good in my experience.

adeelk93 · 3 years ago
And if you want/need to take it further, I love this package: https://github.com/calogica/dbt-expectations
chrisoldwood · 3 years ago
Back in the mid-noughties I decided to see if I could write SQL in a test-first manner (i.e. TDD). This resulted in me writing a 100% T-SQL based unit testing framework for SQL Server [1] which we then used for the SQL back-end at an investment bank.

On the back on that professional use I wrote a blog post [2] explaining why you might choose to go down this route as it wasn't the way database was developed way back then (SQL wasn't developed in the same way as the other front-end and back-end code).

A few years later I gave a short 20-minute talk (videoed) to show what writing SQL using TDD looked like for me. It's hard to show all the kinds of tests we wrote in practice at the bank but the talk is intended to show how rapid the feedback loop can be using a standard DB query tool and two code windows - production code and tests.

Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).

Chris Oldwood

---

[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit

[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...

[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0

AnEro · 3 years ago
Gitlabs has their guide up, I love it and use it all the time. I've been doing data engineering in a small team for about 4 years, helping hospitals with data and hopefully making it easier to understand. Something that is overlooked or undervalued in my opinion, have stark distinctions for separating out technical and business logic tests. It makes it easier communicating what's happening in the event something is 'wrong' vs wrong, and it's easier to bring a non-technically inclined team member up to speed. Also, I think it's good to learn from the SaaS side of things and not bloat up or overengineer with infrastructure as data engineering is the latest development flavour. Keep it simple. Great expectations is a great tool however I think small teams should take really hard looks at their needs and see if a simple orchestration engine and SQL testing is enough. A centralized source for testing is great, however infrastructure isn't free even when it is you are paying for it with you and your teams time.
hampelm · 3 years ago
Could you link to the specific guide you're referring to? I see a couple on quick search -- perhaps this one? https://docs.gitlab.com/ee/development/database_review.html
AnEro · 3 years ago
Enjoy, it's honestly the best resource I've seen on data teams that is open. https://about.gitlab.com/handbook/business-technology/data-t...
dagss · 3 years ago
Probably not that relevant for a data team, but this is what we do as a backend team:

We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.

Then we have our homegrown migration file runner that will compute a hash of the migrations, make a database template_a5757f7e, and run the hundreds of migrations on it, whenever we add a new SQL migration (todo: make one template build on the previous).

Then we use the BACKUP command to dump the db to disk (within the docker image)

Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.

So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.

(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)