Readit News logoReadit News
snidane commented on Get me out of data hell   ludic.mataroa.blog/blog/g... · Posted by u/pavel_lishin
bob1029 · a year ago
I recently got a bit of a shocked reaction when I proposed to directly load daily files into temporary SQL tables and then use merge commands within the database to load the final tables. My use of code is essentially a shim between an SFTP client and SQL Server in this scenario. Maybe ~200 lines to connect, locate the files, run the bulk load operation, and then invoke the merge commands. Most of the fun bits are in the actual merge scripts.

Once your data is safely inside the database (temporary load tables or otherwise), there really isn't a good excuse for pulling it out and playing a bunch of circus tricks on it. Moving and transforming data within the RDBMS is infinitely more reliable than doing it with external tooling. Your ETL code should be entirely about getting the data safely into the RDBMS. It shouldn't even be responsible for testing new/deleted/modified records. You really want to use SQL for that.

You'll also be able to recruit more help if everything is neatly contained within the SQL tooling. In my scenario, business analysts can look at the merge commands and quickly iterate on the data pipeline if certain customers have weird quirks. They cannot do the same with some elaborate set of codebases, microservices, etc.

One specific thing that really sold me on this path was seeing how CTEs and views can make the T part of ETL 10000000x easier than even the fanciest code helpers like LINQ.

snidane · a year ago
The architecture is sound - typically called ELT these days. Dump contents of upstream straight into a database and apply stateless and deterministic operations to achieve the final result tables.

SQL server is where this breaks though. You'll get yelled by DBAs for bad db practices like storing wide text fields without casting them to varchar(32) or varchar(12), primary keys on strings or no indexes at all, and most importantly taking majority of storage on the db host for tbese raw dumps. SQL Server and any traditional database scales by adding machines, so you end up paying compute costs for your storage.

If you use a shared disk system with decoupled compute scaling from storage, then your system is the way to go. Ideally these days dump your files into a file storage like s3 and slap a table abstraction over it with some catalog and now you have 100x less storage costs and about 5-10x increased compute power with things like duckdb. Happy data engineering!

snidane commented on Get me out of data hell   ludic.mataroa.blog/blog/g... · Posted by u/pavel_lishin
snidane · a year ago
Looks like the classic mistake of every data team. Every single office person works with data in one way or another. Having a team called 'data' just opens a blanch check for anyone in the organization to dump every issue and every piece of garbage to this team as long as they can identify it as data.

That's why you build data platforms and name your team accordingly. This is much easier position to defend, where you and your team have a mandate to build tools for other to be efficient with data.

If upstream provides funky logs or jsons where you expect strings, that's for your downstream to worry about. They need the data and they need to chase down the right people in the org to resolve that. Your responsibility should be only to provide a unified access to that external data and ideally some governance around the access like logging and lineage.

Tldr; Open your 'data' mandate too wide and vague and you won't survive as a team. Build data platforms instead.

snidane commented on Why Americans Stopped Moving   axios.com/2024/09/01/amer... · Posted by u/toomuchtodo
js8 · 2 years ago
Since my childhood in post-1989 Czechoslovakia, neoliberal economists have been touting supposed benefits of "flexible" workforce sloshing around the country, and the US was often given as an example, how amazingly nomadic it is.

I always thought it was preposterous. Why should people (workers) be uprooted and move, rather than capital (money)? The latter is supposedly much easier to move around.

So, I am glad Americans are finally coming to some common sense.

snidane · 2 years ago
In post Czechoslovakia if you moved out of your hometown, you lost valuable networks. Healthcare and childcare were always connection based - you needed to know the local doctor, teacher, homebuilder - to get connected with specialists, skip the queues, get preferred treatment, faster processing of documents, etc. Once you part with your network, you start to build new one from scratch which takes many years. Unlike in US which is more market based - as long as you have the money you can recreate similar lifestyle elsewhere in the country.

Second, there was never really a need to rush to move to another location offering better opportunities. As consequence of the 1990s policies the local capital vanished into the hands of western entities and with it the opportunities worth moving for. The post 2000 capital which moved to the region just found spots with cheap labor to build new factories or logistic centers to keep the German powerhouse running. With an unfair advantage of cheap eastern energy, cheap eastern workers across the border and cheap euro currency as a result of sharing it with the unproductive european south.

snidane commented on Pipe Syntax in SQL   research.google/pubs/sql-... · Posted by u/legrangramgroum
snidane · 2 years ago
This style is familiar to those writing dataframe logic in df libraries with sql semantics - spark, polars or duckdb relational (https://duckdb.org/docs/api/python/relational_api.html).

It definitely makes things easier to follow, but only for linear, ie. single table, transformations. The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.

For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).

A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.

snidane commented on DuckDB Community Extensions   duckdb.org/2024/07/05/com... · Posted by u/isaacbrodsky
shubhamjain · 2 years ago
Honest question, how feasible it would be for DuckDB to release a non-columnar version of their DB (or at least make DuckDB a decent choice for a typical web app)? I don't know any other DB that makes installing extensions this easy. The rate at which they're shipping awesome features makes me wonder if they could eventually become a great generic database.

I know, I know, this could just as easily be a double-edged sword. A database should prioritize stability above everything else, but there is no reason why we shouldn't expect them to reach there.

snidane · 2 years ago
What do you need non-columnar layout for? Do you expect thousands of concurrent single row writes at a time?

If you use embedded duckdb on the client, unless the person goes crazy clicking their mouse at 60 clicks/s, duckdb should handle it fine.

If you run it on the backend and expect concurrent writes, you can buffer the writes in concatenated arrow tables, one per minibatch, and merge to duckdb every say 10 seconds. You'd just need to query both the historical duckdb and realtime arrow tables separately and combine results later.

I agree that having a native support for this so called Lambda architecture would be cool to have natively in duckdb. Especially when drinking fast moving data from a firehose.

snidane commented on SEQUEL: A Structured English Query Language (1974)   dl.acm.org/doi/epdf/10.11... · Posted by u/aragonite
snidane · 2 years ago
Does anybody know what is the relation between SQL as SEQUEL and then the competing language QUEL from Ingres db?

It always looked to me as if somebody back then in the database wars tried to word play on each other, one way or another.

https://en.m.wikipedia.org/wiki/QUEL_query_languages

snidane commented on Rejected from YC. Reason: Because I don't have a cofounder   twitter.com/richiemcilroy... · Posted by u/freediver
WhatsName · 2 years ago
Can someone explain the rational for a cofounder, if you are a technical founder? Not sure if there are numbers on this, but it would guess there is an inherent risk associated with conflicts between two founders.
snidane · 2 years ago
Two people already form a 'company' and the hard part of convincing somebody else of your idea has already been accomplished. You can argue growing from 1 person to 2 is the most critical part of company growth. With 2+ cofounders this critical part is already figured out making it much less risky proposition than a single founder who hasn't convinced anybody else to join them yet.
snidane commented on Csvlens: Command line CSV file viewer. Like less but made for CSV   github.com/YS-L/csvlens... · Posted by u/ingve
prudentpomelo · 2 years ago
Awk now supports a `--csv` flag for processing csv's. https://github.com/onetrueawk/awk/blob/master/README.md
snidane · 2 years ago
How does one output properly csv quoted rows? It seems thr csv flag works only for parsing inputs.

u/snidane

KarmaCake day1325November 10, 2016View Original