Building an open data pipeline in 2024

> And if you’re dealing with truly massive datasets you can take advantage of GPUs for your data jobs.

I don't think scale is the key deciding factor for whether GPUs are applicable for a given dataset.

I don't think this is a particularly insightful article. Read the first paragraph of the "Cost" section.

bradford · 2 years ago

> I don't think this is a particularly insightful article.

Data engineering can be lonely. I like seeing the approach that others are taking, and this article gives me a good idea of the implementation stack.

marcyb5st · 2 years ago

After reading this I suggest having a look at apache beam if you are not using it already. I have the feeling that you can achieve the same with way fewer elements in the stack.

Also, were you to decide to run it on another "runner".

Additionally, you can truly reuse your apache beam logic for streaming and batch jobs. Other tools perhaps can do that, but from some experiments I ran some time ago it's not as straightforward.

And finally, if one or more of your processing steps need access to GPUs you can request that (granted that your runner supports that: https://beam.apache.org/documentation/runtime/resource-hints... ).

fredguth · 2 years ago

Same here. I just wished OP pointed to an example repo with a minimum working example.

gchamonlive · 2 years ago

Yes I also believe both the dataset and the transformation algorithms have to lend themselves well to parallelization for GPUs to be useful. GPUs don't do magic they are just really good at parallel computing.

winwang · 2 years ago

That's right, and that means most transforms in big data. The fact that the dataset can be distributed at all typically implies that the task is parallel.

dangoldin · 2 years ago

Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy. Audience here isn't everyone and the goal was to give less experienced data engineers and folk a sense of modern data tools and a possible approach.

But what did you mean by "Read the first paragraph of the `Cost` section"?

llm_trw · 2 years ago

>Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy.

It isn't though.

What matters is the memory footprint of the algorithm during execution.

If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.

Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.

zX41ZdbW · 2 years ago

In fact, the opposite is true. While small datasets can be handled on GPU (although there are no good GPU databases comparable in performance to ClickHouse), large datasets don't fit, and unless there is a large amount of computation per byte of data, moving data around will eat the performance.

lmeyerov · 2 years ago

Sort of

Databricks etc architectures are mostly slowly moving data from A to B and doing little work, and worse when that describes the distributed compute part too. I read a paper awhile back where that was often half the time.

GPU architectures end up being explicitly about scaling the bandwidth. One of my favorite projects to do with teams tackling something like using GPU RAPIDS to scale event/log processing is to get GPU direct storage going. Imagine an SSD array reading back at 100GB/s, which feeds 1-2 GPUs at the same (PCI cards), and then TB/s for loaded data cross-GPU and mind-blowing many FLOPS. Modern GPUs get you 0.5TB+ per-node GPU RAM. So a single GPU node, when you do the IO bandwidth right for such fat streaming, is insane.

So yeah, taking a typical Spark cloud cluster and putting GPUs in vs the above is the difference between drinking from a childrens twirly straw vs a firehose.

amadio · 2 years ago

I take issue with this part of the article:

> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.

There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.

AnthonyMouse · 2 years ago

Not only that, open source and proprietary software both generally handle the common case well, because otherwise nobody would use it.

It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.

datadrivenangel · 2 years ago

Your options are to use off the shelf and end up with a brittle and janky setup, or use open source and end up with a brittle and janky setup that is more customized to your workflows... It's a tradeoff though, and all the hosting and security work of open source can be a huge time sink.

victor106 · 2 years ago

> Cloudflare R2 (better than AWS S3). the article links to [1]

Is R2 really better than S3?

https://dansdatathoughts.substack.com/p/from-s3-to-r2-an-eco...

dangoldin · 2 years ago

From egress + storage cost standpoint absolutely which ends up being a big factor for these large scale data systems.

There’s a prior discussion on HN about that post: https://news.ycombinator.com/item?id=38118577

And full disclosure but I’m author of both posts - just shifted my writing to be more focused on the company one.

esafak · 2 years ago

By what metric? They're worth consideration and getting better: https://blog.cloudflare.com/r2-events-gcs-migration-infreque...

esafak · 2 years ago

Can someone explain this "semantic layer" business (cube.dev)? Is it just a signal registry that helps you keep track of and query your ETL pipeline outputs?

dangoldin · 2 years ago

Author here. Basic idea is you want some way of defining metrics. So something like “revenue = sum(sales) - sum(discount)” or “retention = whatever” which need to be generated via SQL at query time vs built in to a table. Then you can have higher confidence multiple access paths all have the same definitions for the metrics.

Phlogi · 2 years ago

Why do I need sqlmesh if i use dbt/snowflake?

dangoldin · 2 years ago

You don’t need to. dbt/sqlmesh are competitive. I just like the model of sqlmesh over dbt but dbt is much more dominant.