Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

This was nicely foreseen in the original Map - Reduce paper, where the authors write:

  > The issues of how to parallelize the computation, distribute the data, and 
  > handle failures conspire to obscure the original simple computation with large 
  > amounts of complex code to deal with these issues. As a reaction to this 
  > complexity,we designed anew abstraction that allows us to express the simple 
  > computations we were trying to perform but hides the messy details of 
  > parallelization, fault-tolerance, data distribution and load balancing in 
  > a library .

If you are not meeting this complexity (and today with 16 TB of RAM and 192 cores, many jobs don't) then Map-Reduce / Hadoop is not for you...

dijit · 2 years ago

There is an incentive for people to go horizontal rather than permitting themselves to go vertical.

Makes sense, we are told that vertical has limits in university and we should prioritise horizontal; but I feel a little like the "mid-wit" meme, once we realise how vertical we can go then we can end up using significantly fewer resources in aggregate (as there is overhead in distributed systems of course).

I also think we are disincentivised from going vertical as most cloud providers prioritise splitting workloads, most people don't have 16TiB of RAM available to them, but they might have a credit card on file for a cloud provider/hyperscaler.

*EDIT*: Largest AWS Instance is, I think, the x2iedn.metal ith 128vCPU and 4TiB RAM

*EDIT2*: u-24tb1.metal seems larger; 448vCPU and 24TiB Memory, but I'm not sure if you can actually use it for anything that's not SAP HANA.

zzbn00 · 2 years ago

Horizontal scaling did have specific incentives when Map Reduce got going and today also in the right parameter space.

For example, I think Dean & Ghemawat reasonably describe what were their incentives: saving capital by reusing an already distributed set of machines while conserving network bandwidth. In table 1 they write average job duration was around 10 minutes involving 150 computers and that on average 1.2 workers died per such job!

The computers had 2-4 GiB memory, 100megabit ethernet and ISA HDDs. In 2003 when they got map reduce going Google's total R&D budget was $90million. There was no cloud so if you wanted a large machine you had to pay up front.

What they did with Map Reduce is a great achievement.

But I would advise against scaling horizontally right from the start because we may need to scale horizontally at some time in future. If it will fit on one machine, do it on one.

jonstewart · 2 years ago

MapReduce came along at a moment in time where going horizontal was -essential-. Storage had kept increasing faster than CPU and memory, and CPUs in the aughts encountered two significant hitches: the 32-bit to 64-bit transition and the multicore transition. As always, software lagged these hardware transitions; you could put 8 or 16GB of RAM in a server, but good luck getting Java to use it. So there was a period of several years where the ceiling on vertical scalability was both quite low and absurdly expensive. Meanwhile, hard drives and the internet got big.

oblio · 2 years ago

The problem is that you do want some horizontal scaling regardless, just to avoid SPOFs as much as you can.

edgyquant · 2 years ago

Plus horizontal scaling is sexier

…yes - processing 3.2G of data will be quicker on a single machine. This is not the scale of Hadoop or any other distributed compute platform.

The reason we use these is for when we have a data set _larger_ than what can be done on a single machine.

ralph84 · 2 years ago

Most people who wasted $millions setting up Hadoop didn’t have data sets larger than could fit on a single machine.

faet · 2 years ago

I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.

hobos_delight · 2 years ago

I completely agree. I love the tech and have spent a lot of time in it - but come on people, let’s use the right tool for the right job!

saberience · 2 years ago

Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?

I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.

I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.

Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?

hiAndrewQuinn · 2 years ago

Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.

OskarS · 2 years ago

This is exactly the point of the article. From the conclusion:

> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

the8472 · 2 years ago

What can be done on a single machine grows with time though. You can have terabytes of ram and petabytes of flash in a single machine now.

MrBuddyCasino · 2 years ago

This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.

And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.

hobos_delight · 2 years ago

Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.

dagw · 2 years ago

The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.

Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.

hliyan · 2 years ago

I've written this comment before: in 2007, there was a period where I used to run an entire day's worth of trade reconciliations of one of the US's primary stock exchanges on my laptop (I was on-site engineer). It was a Perl script, and it completed in minutes. A decade later, I watched incredulously as a team tried to spin up a Hadoop cluster (or Spark -- I forget which) over several days, to run a work load an order of magnitude smaller.

jjav · 2 years ago

> over several days, to run a work load an order of magnitude smaller

Here I sit, running a query on a fancy cloud-based tool we pay nontrivial amounts of money for, which takes ~15 minutes.

If I download the data set to a Linux box I can do the query in 3 seconds with grep and awk.

Oh but that is not The Way. So here I sit waiting ~15 minutes every time I need to fine tune and test the query.

Also, of course the query now is written in the vendor's homegrown weird query language which is lacking a lot of functionality, so whenever I need to do some different transformation or pull apart data a bit differently, I get to file a feature request and wait a few month for it to be implemented. On the linux box I could just change my awk parameters a little bit (or throw perl in the pipeline for heavier lifting) and be done in a minute. But hey at least I can put the ticket in blocked state for a few months while waiting for the vendor.

Why are we doing this?

jdksmdbtbdnmsm · 2 years ago

>Why are we doing this?

someone got promoted

liveoneggs · 2 years ago

yeah but who was getting better stuff on their resume? didn't you get the memo about perl?

Just because your throw-away 40 line script worked from cron for five years without issue doesn't mean that a seven node hadoop cluster didn't come with benefits. You got to write in a language called "pig"! so fun.

jan_Sate · 2 years ago

I still think that it'd be easier to maintain the script that runs on a single computer than to maintain a hadoop cluster.

asdffdasasdf · 2 years ago

maybe we should all start to add "evaluated a hadoop cluster for X applications and saved the company 1mi (in time, headcount, and uptime) a year going with a 40line perl script"

RcouF1uZ4gsC · 2 years ago

> yeah but who was getting better stuff on their resume? didn't you get the memo about perl?

That is why Rust is so awesome. It still allows me to get stuff in my resume, but still make an executable that runs on my laptop with high performance.

ramon156 · 2 years ago

Id love to hear what the benefits are to using a framework for the wrong purpose

jasfi · 2 years ago

There was a time, about 10 years ago, when Hadoop/Spark was on just about every back-end job post out there.

forinti · 2 years ago

People should first try the simplest most obvious solution just to have a baseline before they jump into the fancy solutions.

alberth · 2 years ago

I imagine your laptop had an SSD.

People who weren’t developing around this time can’t appreciate how game changing SSDs were then spinning rust.

I/O was no longer the bottleneck post SSD’s.

Even today, people way underestimate the power of NVME.

Wonnk13 · 2 years ago

One of my favorite posts. I'll always upvote this. Of course there are use cases one or two standard deviations outside the mean that require truly massive distributed architectures, but not your shitty csv / json files.

Reflecting on a decade in the industry I can say cut, sort, uniq, xargs, sed, etc etc have taken me farther than any programming language or ec2 instance.

_xivi · 2 years ago

https://news.ycombinator.com/item?id=30595026 - 1 year ago (166 comments)

https://news.ycombinator.com/item?id=22188877 - 3 years ago (253 comments)

https://news.ycombinator.com/item?id=17135841 - 5 years ago (222 comments)

https://news.ycombinator.com/item?id=12472905 - 7 years ago (171 comments)

pvg · 2 years ago

I think you have an off by one year error in these.

donatj · 2 years ago

My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.

It’s been 8 years and I think RDBMS is stronger than ever?

Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.

https://twitter.com/donatj/status/740210538320273408

michaelmior · 2 years ago

Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.

wenc · 2 years ago

Spark is still pretty non performant.

If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.

hobs · 2 years ago

In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.

Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.

bblaylock · 2 years ago

These posts always remind me of the [Manta Object Storage](https://www.tritondatacenter.com/triton/object-storage) project by Joyent. This project was basically a combination of object storage with the added ability to run arbitrary programs against your data in situ. The primary, and key, difference being that you kept the data in place and distributed the program to the data storage nodes (the opposite of most data processing as I understand it), I think of this as a superpowered version of using [pssh](https://linux.die.net/man/1/pssh) to grep logs across a datacenter. Yet another idea before its time. Luckily, Joyent [open sourced](https://github.com/TritonDataCenter/manta) the work, but the fact that it still hasn't caught on as "The Way" is telling.

Some of the projects I remember from the Joyent team were: dumping recordings of local mariokart games to manta and running analytics on the raw video to generate office kart racer stats, the bog standard dump all the logs and map/reduce/grep/count them, and I think there was one about running mdb postmortems on terabytes of core dumps.

rbanffy · 2 years ago

On a similar reasoning, in 2008 or such, I observed that, while our Java app would be able to run more user requests per second than our Python version, it’d take months for the Java app to overtake the Python one in total requests served because it’d have to account for a 6 month head start.

Far too often we waste time optimising for problems we don’t have, and, most likely, will never have.