After reading this I suggest having a look at apache beam if you are not using it already. I have the feeling that you can achieve the same with way fewer elements in the stack.
Also, were you to decide to run it on another "runner".
Additionally, you can truly reuse your apache beam logic for streaming and batch jobs. Other tools perhaps can do that, but from some experiments I ran some time ago it's not as straightforward.
Yes I also believe both the dataset and the transformation algorithms have to lend themselves well to parallelization for GPUs to be useful. GPUs don't do magic they are just really good at parallel computing.
That's right, and that means most transforms in big data. The fact that the dataset can be distributed at all typically implies that the task is parallel.
Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy. Audience here isn't everyone and the goal was to give less experienced data engineers and folk a sense of modern data tools and a possible approach.
But what did you mean by "Read the first paragraph of the `Cost` section"?
>Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy.
It isn't though.
What matters is the memory footprint of the algorithm during execution.
If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.
Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.
In fact, the opposite is true. While small datasets can be handled on GPU (although there are no good GPU databases comparable in performance to ClickHouse), large datasets don't fit, and unless there is a large amount of computation per byte of data, moving data around will eat the performance.
Databricks etc architectures are mostly slowly moving data from A to B and doing little work, and worse when that describes the distributed compute part too. I read a paper awhile back where that was often half the time.
GPU architectures end up being explicitly about scaling the bandwidth. One of my favorite projects to do with teams tackling something like using GPU RAPIDS to scale event/log processing is to get GPU direct storage going. Imagine an SSD array reading back at 100GB/s, which feeds 1-2 GPUs at the same (PCI cards), and then TB/s for loaded data cross-GPU and mind-blowing many FLOPS. Modern GPUs get you 0.5TB+ per-node GPU RAM. So a single GPU node, when you do the IO bandwidth right for such fat streaming, is insane.
So yeah, taking a typical Spark cloud cluster and putting GPUs in vs the above is the difference between drinking from a childrens twirly straw vs a firehose.
> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.
There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.
Not only that, open source and proprietary software both generally handle the common case well, because otherwise nobody would use it.
It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.
Your options are to use off the shelf and end up with a brittle and janky setup, or use open source and end up with a brittle and janky setup that is more customized to your workflows... It's a tradeoff though, and all the hosting and security work of open source can be a huge time sink.
Can someone explain this "semantic layer" business (cube.dev)? Is it just a signal registry that helps you keep track of and query your ETL pipeline outputs?
Author here. Basic idea is you want some way of defining metrics. So something like “revenue = sum(sales) - sum(discount)” or “retention = whatever” which need to be generated via SQL at query time vs built in to a table. Then you can have higher confidence multiple access paths all have the same definitions for the metrics.
I don't think scale is the key deciding factor for whether GPUs are applicable for a given dataset.
I don't think this is a particularly insightful article. Read the first paragraph of the "Cost" section.
Data engineering can be lonely. I like seeing the approach that others are taking, and this article gives me a good idea of the implementation stack.
Also, were you to decide to run it on another "runner".
Additionally, you can truly reuse your apache beam logic for streaming and batch jobs. Other tools perhaps can do that, but from some experiments I ran some time ago it's not as straightforward.
And finally, if one or more of your processing steps need access to GPUs you can request that (granted that your runner supports that: https://beam.apache.org/documentation/runtime/resource-hints... ).
But what did you mean by "Read the first paragraph of the `Cost` section"?
It isn't though.
What matters is the memory footprint of the algorithm during execution.
If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.
Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.
Databricks etc architectures are mostly slowly moving data from A to B and doing little work, and worse when that describes the distributed compute part too. I read a paper awhile back where that was often half the time.
GPU architectures end up being explicitly about scaling the bandwidth. One of my favorite projects to do with teams tackling something like using GPU RAPIDS to scale event/log processing is to get GPU direct storage going. Imagine an SSD array reading back at 100GB/s, which feeds 1-2 GPUs at the same (PCI cards), and then TB/s for loaded data cross-GPU and mind-blowing many FLOPS. Modern GPUs get you 0.5TB+ per-node GPU RAM. So a single GPU node, when you do the IO bandwidth right for such fat streaming, is insane.
So yeah, taking a typical Spark cloud cluster and putting GPUs in vs the above is the difference between drinking from a childrens twirly straw vs a firehose.
> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.
There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.
It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.
Is R2 really better than S3?
https://dansdatathoughts.substack.com/p/from-s3-to-r2-an-eco...
There’s a prior discussion on HN about that post: https://news.ycombinator.com/item?id=38118577
And full disclosure but I’m author of both posts - just shifted my writing to be more focused on the company one.