Command-line Tools can be 235x Faster than a Hadoop Cluster (2014)

I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.

cm2187 · 7 years ago

The tree swing cartoon has never been more true [1]

Large companies would benefit from dev not over-engineering simple apps and spending less time on their own tools and ticket systems, and spending more time instead on solving/automating more problems for the company.

[1] https://www.tamingdata.com/wp-content/uploads/2010/07/tree-s...

moray · 7 years ago

I have a very similar experience, in particular with my current project one of the users (a sysadmin actually) asked if I was using elasticsearch or similar because he noticed that searching a record and filtering was very fast.

My response: nope, MySQL! (plus an ORM, a very hated programming language and very few config optimisations on the server side).

This project DB has a couple of tables with thousands of records, not billions, and, for now, a few users (40-50).. a good schema and a few well done queries can do the trick.

I guess some people are so used to see sluggish application that as soon as they see something that goes faster than average they think it must use some cool latest big tech.

nucleardog · 7 years ago

More than once I've run across situations of everyone sitting around going "we need to scale up the server or switch the database or rewrite some of this frontend stuff or something it's so slow and there's nothing we can do" and solved their intractable performance problems by just adding an index in MySQL that actually covers the columns they're querying on.

Lots of people seem to want some silver bullet magical software/technology to solve their problems instead of learning how to use the tools they have. That's not software development.

Deleted Comment

dajonker · 7 years ago

kdb+ is very cool. Just a single binary, less than 1 MB and yet an extremely fast and powerful tool for analytics. There's a very nice talk on CMU's series of database videos if you want to know more. https://www.youtube.com/watch?v=AiGdfmxEP68 And it scores extremely well on the 1 billion taxi drives benchmarks (no cluster but some seriously big machine): http://tech.marksblogg.com/billion-nyc-taxi-kdb.html

nikanj · 7 years ago

A lot of architecture is designed from the "what would enable me to job-hop to the next position on the seniority pole" vantage point.

y4mi · 7 years ago

RDD probably stands for 'Resumee driven developemnt'. so yes, the parent agrees.

davio · 7 years ago

We call that "LinkedIn Driven Development" (LIDD)

commandlinefan · 7 years ago

> senority pole

Well, I'm not sure "seniority" is the right word - the more tech stuff you know, in general, the _less_ seniority you're going to achieve in terms of org charts, decent seating, respect and actual pull within an organization. You can achieve job security and higher pay that way, though.

asfdsfggtfd · 7 years ago

And if your data is really big then so long as it is structured something like BigQuery lets you carry on using standard SQL queries...

I once converted a simulation into cython from plain old python.

Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).

Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.

Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.

Sometimes it's worth it to just step back and experiment a bit.

vvanders · 7 years ago

Yeah, back when I was in gamedev land and multi-cores started coming on the scene it was "Multithread ALL THE THINGS". Shortly there after people realized how nasty cache invalidation is when two cores are contending over one line. So you can have the same issue show up even in a single machine scenario.

Good understanding of data access patterns and the right algorithm go a long way in both spaces as well.

notacoward · 7 years ago

Even earlier, when SMP was hitting the server room but still far from the desktop, there was a similar phenomenon of breaking everything down to use ever finer-grain locks ... until the locking overhead (and errors) outweighed any benefit from parallelism. Over time, people learned to think about expected levels of parallelism, contention, etc. and "right size" their locks accordingly.

Computing history's not a circle, but it's damn sure a spiral.

derefr · 7 years ago

What kind of games? I always thought that e.g. network-synced simulation code in RTSes or MMOs would be extremely amenable to multithreading, since you could just treat it like a cellular automata: slicing up the board into tiles, assigning each tile to a NUMA node, and having the simulation-tick algorithm output what units or particles have traversed into neighbouring tiles during the tick, such that they'll be pushed as messages to that neighbouring tile's NUMA node before the next tick.

(Obviously this wouldn't work for FPSes, though, since hitscan weapons mess the independence-of-tiles logic all up.)

foobarian · 7 years ago

Hadoop has its time and place. I love using hive and watching the consumed CPU counter tick up. When I get our cluster to myself and it steps 1hr of CPU every second it's quite a sight to see.

eythian · 7 years ago

Yeah, I got a bit of a shock when I first used our one and my fifteen minute query took two months of CPU time. Thought maybe I'd done something wrong until I was assured that was quite normal.

sitkack · 7 years ago

Twitter?

tluyben2 · 7 years ago

3pt14159 · 7 years ago

mpweiher · 7 years ago

"You can have a second computer once you've demonstrated you know how to use one".

ikeboy · 7 years ago

Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.

One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.

Morals:

1. Small changes can have a 3+ orders of magnitude effect on performance

2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)

dorfsmay · 7 years ago

csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.

hobs · 7 years ago

If SQL is your game but you dont want to get PG setup - try SQLite -

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" rows.csv
  sqlite3
  .mode csv
  .import ./rows.csv newyorkdata
  SELECT *
  FROM newyorkdata
  ORDER BY `COUNT PARTICIPANTS`;

ranit · 7 years ago

What did you use to sort these 10 million lines?

sort in bash.

Specific command:

LC_ALL=c sort -n filename -r > output

0xcde4c3db · 7 years ago

See also: Scalability! But at what COST? [1] [2]

> The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.

[1] http://www.frankmcsherry.org/assets/COST.pdf

[2] https://news.ycombinator.com/item?id=11855594

nisa · 7 years ago

Keep in mind it's for graph processing - Hadoop/HDFS still shines for data-intensive streaming workloads like indexing a few hundred terabytes of data - where you can exploit the parallel disk io of all disks in the cluster - if you have 20 machines with 8 disks in a cluster that's 20 * 8 * 100mbyte/s = 16gbyte/s throughput - for 200 machines it's 160gbyte/s.

However for iterative calculations like pagerank the overhead for distributing the problems is often not worth it.

For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.

The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.

There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.

mcguire · 7 years ago

Normal developer behavior has gone from "optimize everything for machine usage" (cpu time, memory, etc.) to "optimize everything for developer convenience". The former is frequently inappropriate, but the latter is, as well.

(And some would say that it then went to "optimize everything for resume keywords," which is almost always inappropriate, but I don't want to be too cynical.)

Oh, it was also less code.

tzahola · 7 years ago

Ha! I did the same for my BSc thesis: On-device route planning from GTFS data for Budapest’s public transport network

iOS used to kill processes above 50 megs of RAM usage. Good times!

Cool! Link?

bufferoverflow · 7 years ago

2016: https://news.ycombinator.com/item?id=12472905

2015: https://news.ycombinator.com/item?id=8908462

2018: https://news.ycombinator.com/item?id=16810756

jwilk · 7 years ago

The last one has zero comments, so there's no point linking to it.

albemuth · 7 years ago

Unless you're trying to make a point about reposting.

makapuf · 7 years ago

in other words, "Too big for excel is not big data" https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

mseebach · 7 years ago

I heard a variation on this: it's not big data until it can't fit in RAM in a single rack.

zackelan · 7 years ago

The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.

I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.

imtringued · 7 years ago

If you have two dozen nodes each with 6 TB of RAM + a few petabytes of HDD storage you're definitively going to need a big data solution.

digi_owl · 7 years ago

Cloud, big data, "AI", i wonder what will be the next "me too, look at me" kind of buzzword for the corporate world...

cratermoon · 7 years ago

"blockchain"

erikig · 7 years ago

I keep coming back to that article everytime someone talks about big data :)