Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?
MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.
MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:
- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)
- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.
Are you confusing kafka with something else? Kafka is a persistent write append queue.
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.
Why use map:reduce when you can have an entire DAG for fanout/in?
It's going to stay because it is useful:
Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.
If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)
I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.
I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.
But that doesn't really mean it's useful to think of GOTOs being the generalisation.
Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...
It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.
Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.
Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.