I've been looking for resources for our data team to apply best practices for testing SQL pipelines (we use DBT) but have not found anything. How do you test SQL pipelines? What patterns, tools and best practices would you recommend? Any good reference material you know of?
If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast
One challenge if you're using Spark is that test can be frustratingly slow to run. One possible solution (that I use myself) is to run most tests using DuckDB, and only e.g. the overall test using Spark SQL.
I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.
Also I have no idea whether this is good practice - it's just something that seemed to work well for me.
The approach with csvs can be nice because your customers can review these files for correctness (they may be the owners of the metric), without them needing to be coders. They just need to confirm in.csv should result in expected_out.csv.
If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe
One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this
If you build your unit tests so that a thread will not commit until it's finished, or clear it before/after, it's pretty ideal for that. You can feed in data for that test and it won't be visible anywhere else.
Potentially if you scripted your table creates you could add an additional rule that GLOBAL TEMPORARY TABLE gets added to any other tabledefs but only for unit tests. Or just update both in parallel.
The other useful use-case I've found for this is that you can do an unlimited-size "WHERE myId IN (:1, :2 ...)" by doing a GTT expressed as "INNER JOIN myId = myGtt.myid", and this has the advantage of not thrashing the query planner for every literal sqltext resulting from different lengths of myList (this will fill up your query cache!). Since it's one literal string it always hits the same query plan cache.
I am told this has problems in oracle's OCI when using PDBs, apparently GTTs will shit up the redo-log tablespace (WAL equivalent). But so do many many things, PDBs are apparently a much different and much less capable and incredibly immature implementation that seems essentially abandoned/not a focus from what I've been told. I was very surprised having originally written that code in regular Oracle and not having had problems but PDBs just aren't the same (right down to nulls being present in indexes!).
I used this trick (join temporaryFoo instead of where foo in ...) in production fifteen years ago, using MySQL. The gain was really astonishing. Several instructions can be optimized using joins on specialty craft tables (I know of LIMIT for instance).
This is one of the worst drawbacks of orm everywhere: nobody even seems to think about those optimisations anymore.
* reuse the same Spark session throughout the test suite
* Set shuffle partitions to 2 (instead of default which is 200)
* Use dependency injection to avoid disk I/O in the test suite
* Use fast DataFrame equality when possible. assertSmallDataFrameEquality is 4x faster than assertLargeDataFrameEquality. Some benchmarks here: https://github.com/MrPowers/spark-fast-tests#why-is-this-lib...
* Use column equality to test column functions. Don't compare DataFrames unless you're testing custom DataFrame transformations. See the spark-style-guide for definitions for these terms: https://github.com/MrPowers/spark-style-guide/blob/main/PYSP...
Spark is an underrated tool for testing SQL. Spark makes it really easy to abstract SQL into unit testable chunks. Configuring your tests properly takes some knowledge, but you can make the tests run relatively quickly.
A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements.
I find this helps a heck of a lot with maintainability + debugging as well
1. Make sure you're using a shared session between the test suite. So that spin-up only has to occur once per suite and not per test. This has the drawback of not allowing dataframe name reuse across tests, but who cares.
2. If you have any kind of magic-number N in the SparkSQL or dataframe calls (e.g coalesce(N), repartition(N)) change N to be parameterized, and set it to 1 for the test.
3. Make sure the master doesn't have any more than 'local[2]' set. Or less depending on your workstation.
Not much more to say, just observing, sorry if this is irrelevant commentary.
The title is "Ask HN: How do you test SQL"
Could you or anyone else on the post provide an example?
if you are storing fully denormalized concrete data about the value of salary/medical/retirement both pre- and post-tax that was actually paid to each pay period (because this can vary!), then you can define a view that does salary-per-employee (taxable, untaxable, etc), and then a view that rolls up employees-per-department. And you can write unit tests for all of those.
that's a super contrived example but basically once group aggregate or window functions and other complex sub-queries start coming into the picture it becomes highly desirable to write those as their own views. And you can write some simple unit tests for those views. there are tons of shitty weird sql quirks that come from nullity/etc and you can have very weird specific sum(mycol where condition) and other non-trivial sub-subquery logic, and it's simple to just write an expression that you think is true and validate that it works like you think, that all output groups (including empty/null groups etc) that you expect to be present or not present actually are/aren't, etc.
I'm not personally advocating for writing those as CTEs specifically as a design goal in preference to views, personally I'd rather write views where possible. But recursive CTEs are the canonical approach for certain kinds of queries (particularly node/tree structures) and at minimum a CTE certainly is a "less powerful context" than a generalized WHERE EXISTS (select 1 from ... WHERE myVal = outerVal) or value-select subquery. it's desirable to have that isolation from the outer SQL query cursor imo (and depending on what you're asking, it may optimize to something different in terms of plan).
Writing everything as a single query, where the sub-sub-query needs to be 100% sure not to depend on the outer-outer-cursor, is painful. What even is "DEEP_RANK()" in the context of this particular row/window? If you've got some bizarre (RANK(myId order by timestamp) or whatever, does it really work right? Etc. It's just a lot easier to conceptually write each "function" as a level with its own unit tests. Same as any other unit-testable function block, it's ideal if it's Obviously Correct and then you define compositions of Obvious Correctness with their own provable correctness.
And if it's not Obviously Correct then you need the proof even more. Encapsulate whatever dumb shit you have to do to make it run correctly and quick into a subquery and just do a "inner join where outerQuery.myId = myView.myId". Hide the badness.
Dead Comment
I used this exact same method. Not only does it help me but those who come after trying to understand what's going on.
Every time new issues cropped up we would put new data in the test data designed to reproduce it. Every edge case we would run into.
It provided so much confidence because it would catch and trigger so many edge cases that testing with mocks or by hand would miss.
Edit: also, it’s great for test/dev environments. You don’t have to worry about losing important data or filling new environments with data. Just start with the full test data and you’re good to go. It’s got stuff for all the corner cases already. Something got screwed up? Blow it away and reload, it’s not precious.
at my company they just told us to stop reporting edge cases. much easier, much cheaper.
The really great thing about this approach (IMHO), is that you can validate certain constraint violations.
For example, exclusion constraints are great for modelling certain use cases where overlapping ranges should be avoided. In our (go) code, the test cases can use the sqlstate code, or the constraint name to figure out if we hit the error we expect to hit.
This approach is pretty much as fast as our unit tests (your mileage may vary), but it prevents way more bugs from being merged into our codebase.
https://www.testcontainers.org/
> Testcontainers for .NET is a library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions. The library is built on top of the .NET Docker remote API and provides a lightweight implementation to support your test environment in all circumstances.
Edit: Ok, example helps. https://dotnet.testcontainers.org/examples/aspnet/
I can declare docker infrastructure from my code, right?
¹: https://sqitch.org/
Our overall strategy is to create a master "test" DB with a test dataset, and for each test, copy this master DB to a test specific DB (CREATE DATABASE <testdb> TEMPLATE <master>) so that tests can run in parallel without interfering with each other and without the significant overhead of a "from scratch" DB initialization.
For schema migrations, we build the "branch/PR" version and the "main" version, then we check that 'init DB with "main" + migration with the "branch/PR" version' results in the same schema as 'init the DB directly with the "branch/PR" version' using apgdiff.
This strategy could probably be extended to migrating from every older version by building each tag, but we don't have that need.
We could also probably improve checks on the data itself however as for now, we only check the schemas.
Few things to note:
* it's still possible to run the tests outside of docker and use a local DB instead with some light setup (it's faster than running everything in docker when developing)
* docker argument --tmpfs <dir/of/db> is quite good, assuming you have enough ram for your dataset
* few configuration tweaks on the DB, like max connection might be necessary.
Overall, we are quite happy with this setup as it permits to implement end to end integration tests quite easily without spending too much time mocking dependencies.
As a general pattern, I find instantiating dependencies internal to your service (like a DB or Queue) to be the way to go, with mocking only for external dependencies (like external APIs) or exceptionally to reach a specific code branch (specially error handling sections).
Since OP mentioned DBT (kind of weird, hopefully, it's at least DBT2, since DBT is very old), they mean to test the productivity of the system rather than correctness of some queries (typical tests that deal with workloads similar to DBT2 are, eg. pgbench). Running pgbench over a database setup in a container will tell you nothing / might just confuse you, if you don't understand the difference between how database is meant to be setup and what happens in container.
Took about a week of duplicating happy path tests, investigating gaps, and switching inputs/assertion types to get everything passing, but less than a week later we had our first verifiable test failure.
I think the answers would depend on the types of tests that the term "this" encompasses. From how I understand it, calling something a unit test or an integration test depends on the context of what is is being tested. For example, if a developer is writing a unit test for a HTTP handler, and the handler implementation includes calls to an external database, then one would have to use a mock for the database, in order for the test to be deemed a true unit test [1]. If the developer is writing an integration test for the same HTTP handler, then the database would have to be a real database implementation [2].
On other hand, if the developer were testing SQL queries for syntactical or logical correctness against a database, these tests would want to use a real database implementation. Note that though the test uses a real database, it is still a true unit test [3]. Additionally, note that using a mocked database here would not serve the purpose of the tests, which is to catch syntactical and logical errors against a database engine. This can, of course, only be achieved by using a real database—or, if you insisted on using a "mock", then, by implementing an entire SQL engine, with the exact same quirks as the real database's engine, inside the mock!
On the original question:
> Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?
I guess the answer would be: It would depend on the objectives of and types of tests. Do both of them, because some tests, such as unit tests on the HTTP handler, would use use mocks, while other tests, such as the SQL query correctness tests, would use the real database.
[1] A true unit test is one that has no external interactions, neither directly nor transitively, besides interactions with the system under test (SUT). The SUT here is the HTTP handler.
[2] An integration test should include external system interactions, if any. That's what integration means.
[3] The SUT is each SQL query and the its interaction with the database. There are no interactions in the test with systems outside the SUT, so it is still a true unit test.
jokes aside, redshift is based on pg^1, you can try an older version to get some semblance of it running locally.
1. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-an...
I have seen teams give each developer a personal schema on a dev cluster, to ensure their Redshift SQL actually works. The downside is that now your tests are non-local, so it's a real tradeoff. In CI you probably connect to a real test cluster.
Deleted Comment
Here's [2] a slide deck by David Wheeler giving an introduction into how it works.
[1] https://pgtap.org/
[2] https://www.slideshare.net/justatheory/unit-test-your-databa...
I'm using it w/ supabase, and it works really well.
They have a small doc on it that's a better primer than the pgtap docs: <https://supabase.com/docs/guides/database/extensions/pgtap>.
Pretty easy to get started, I'm doing something like this in a Makefile.
Guide is here: https://supabase.com/docs/guides/database/testing
Plenty of constraints, uniques and foreign keys and not nulls. Enum types.
Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.
Asserts in DO blocks, mostly counts being equal.
Build tables in a a _next suffix schema and swap when done.
Never mutating the source data.
Using psqls ON_ERROR_STOP setting.
Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.
“Wasting” machine resources and always rebuilding from scratch when feasible. CREATE TABLE foo AS SELECT is much simpler than figuring out which row to UPDATE. Also ensures reproducibility, if you’re always reproducing from scratch it’s always easy. State is hard.
Overall i’m quite happy with the workflow and very rarely do we make mistakes that unit tests would have caught. Our source data is complex and not always well understood (10+ years of changing business logic) so writing good tests would be very hard. Because we never touch the raw source data any errors we inevitably make are recoverable.
This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4
I also have a few 'test' queries that insert there results into a 'test_results' table. Most of the queries check the cardinality of the table since to me wrong cardinality is where the biggest errors come from. I do something like :
insert into test_table
select case when count(*) = count(distinct users) then 'pass' else 'fail' end as result, 'test_cardinality_temporary_table_a' as test_name from temporary_table_a
You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.
Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)
OP (or others) - If you've used dbt tests, I'm curious where it fell short? Tt doesn't cover everything, but it's pretty good in my experience.
On the back on that professional use I wrote a blog post [2] explaining why you might choose to go down this route as it wasn't the way database was developed way back then (SQL wasn't developed in the same way as the other front-end and back-end code).
A few years later I gave a short 20-minute talk (videoed) to show what writing SQL using TDD looked like for me. It's hard to show all the kinds of tests we wrote in practice at the bank but the talk is intended to show how rapid the feedback loop can be using a standard DB query tool and two code windows - production code and tests.
Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).
Chris Oldwood
---
[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit
[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...
[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0
We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.
Then we have our homegrown migration file runner that will compute a hash of the migrations, make a database template_a5757f7e, and run the hundreds of migrations on it, whenever we add a new SQL migration (todo: make one template build on the previous).
Then we use the BACKUP command to dump the db to disk (within the docker image)
Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.
So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.
(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)