Readit News logoReadit News
Posted by u/bk146 2 years ago
Ask HN: Does (or why does) anyone use MapReduce anymore?
Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?
tjhunter · 2 years ago
(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

H8crilA · 2 years ago
There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.
refulgentis · 2 years ago
Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.
lupire · 2 years ago
Reduce is useful for aggregate metrics.
VirusNewbie · 2 years ago
For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.

dtoma · 2 years ago
The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

nwsm · 2 years ago
This book looks interesting, should I buy it or does anyone else have newer recommendations? I have Designing Data-Intensive Applications which is a fantastic overview and still holds up well.
erikerikson · 2 years ago
That was one of "the" books in the space prior to DDIA. In my opinion Akidao mixes the logic for processing events with the stream infrastructure implementation because he was writing from the context of his particular use cases. The time that I spoke with him it seemed that his influence had driven to the design of Google's systems and GCP such that they didn't properly prioritize ordering/linearity/consistency requirements. At this point my copy is of historic interest to me.
62951413 · 2 years ago
It has the most interesting/conceptual/detailed discussion of the streaming system semantics (e.g. interplay of windows and stateful stream operations) I'm aware of to this day. At least as far as Manning/O'Reilly-level books go. So I'd put it on the same bookshelf as DDIA.

It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.

bk146 · 2 years ago
Thank you, I'll check this out!
throwaway5959 · 2 years ago
I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.
falcor84 · 2 years ago
Yeah, that's exactly how I read the question, i.e. analogous to "does anyone still code in assembly, or has everyone switched to using abstractions?" and I think it's a very interesting one.
KaiserPro · 2 years ago
Its more like does anyone use goto.

Why use map:reduce when you can have an entire DAG for fanout/in?

dehrmann · 2 years ago
At a high level, most distributed data systems look something like MapReduce, and that's really just fancy divide-and-conquer. It's hard to reason about, and most data at this size is tabular, so you're usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.
BenoitP · 2 years ago
The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)

atombender · 2 years ago
That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.
danpalmer · 2 years ago
> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.

atombender · 2 years ago
I don't mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it's not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.
eru · 2 years ago
I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...

dikei · 2 years ago
MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.

lupire · 2 years ago
The abstraction came first. MapReduce was quickly used as a basis for larger-than-machine SQL (Google Dremel and Hadoop Pig). MapReduce was separately useful when the processing pieces require a lot of custom code that doesn't fit well into SQL (because you have hierarchical records, not purely relational, for example)
DeathArrow · 2 years ago
Can you point, please, to the better abstractions?
willvarfar · 2 years ago
SQL comes to mind.

Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.

qoega · 2 years ago
Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.