That's why you build data platforms and name your team accordingly. This is much easier position to defend, where you and your team have a mandate to build tools for other to be efficient with data.
If upstream provides funky logs or jsons where you expect strings, that's for your downstream to worry about. They need the data and they need to chase down the right people in the org to resolve that. Your responsibility should be only to provide a unified access to that external data and ideally some governance around the access like logging and lineage.
Tldr; Open your 'data' mandate too wide and vague and you won't survive as a team. Build data platforms instead.
I always thought it was preposterous. Why should people (workers) be uprooted and move, rather than capital (money)? The latter is supposedly much easier to move around.
So, I am glad Americans are finally coming to some common sense.
Second, there was never really a need to rush to move to another location offering better opportunities. As consequence of the 1990s policies the local capital vanished into the hands of western entities and with it the opportunities worth moving for. The post 2000 capital which moved to the region just found spots with cheap labor to build new factories or logistic centers to keep the German powerhouse running. With an unfair advantage of cheap eastern energy, cheap eastern workers across the border and cheap euro currency as a result of sharing it with the unproductive european south.
It definitely makes things easier to follow, but only for linear, ie. single table, transformations. The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.
For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).
A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.
I know, I know, this could just as easily be a double-edged sword. A database should prioritize stability above everything else, but there is no reason why we shouldn't expect them to reach there.
If you use embedded duckdb on the client, unless the person goes crazy clicking their mouse at 60 clicks/s, duckdb should handle it fine.
If you run it on the backend and expect concurrent writes, you can buffer the writes in concatenated arrow tables, one per minibatch, and merge to duckdb every say 10 seconds. You'd just need to query both the historical duckdb and realtime arrow tables separately and combine results later.
I agree that having a native support for this so called Lambda architecture would be cool to have natively in duckdb. Especially when drinking fast moving data from a firehose.
It always looked to me as if somebody back then in the database wars tried to word play on each other, one way or another.
Once your data is safely inside the database (temporary load tables or otherwise), there really isn't a good excuse for pulling it out and playing a bunch of circus tricks on it. Moving and transforming data within the RDBMS is infinitely more reliable than doing it with external tooling. Your ETL code should be entirely about getting the data safely into the RDBMS. It shouldn't even be responsible for testing new/deleted/modified records. You really want to use SQL for that.
You'll also be able to recruit more help if everything is neatly contained within the SQL tooling. In my scenario, business analysts can look at the merge commands and quickly iterate on the data pipeline if certain customers have weird quirks. They cannot do the same with some elaborate set of codebases, microservices, etc.
One specific thing that really sold me on this path was seeing how CTEs and views can make the T part of ETL 10000000x easier than even the fanciest code helpers like LINQ.
SQL server is where this breaks though. You'll get yelled by DBAs for bad db practices like storing wide text fields without casting them to varchar(32) or varchar(12), primary keys on strings or no indexes at all, and most importantly taking majority of storage on the db host for tbese raw dumps. SQL Server and any traditional database scales by adding machines, so you end up paying compute costs for your storage.
If you use a shared disk system with decoupled compute scaling from storage, then your system is the way to go. Ideally these days dump your files into a file storage like s3 and slap a table abstraction over it with some catalog and now you have 100x less storage costs and about 5-10x increased compute power with things like duckdb. Happy data engineering!