I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.
The tree swing cartoon has never been more true [1]
Large companies would benefit from dev not over-engineering simple apps and spending less time on their own tools and ticket systems, and spending more time instead on solving/automating more problems for the company.
I have a very similar experience, in particular with my current project one of the users (a sysadmin actually) asked if I was using elasticsearch or similar because he noticed that searching a record and filtering was very fast.
My response: nope, MySQL! (plus an ORM, a very hated programming language and very few config optimisations on the server side).
This project DB has a couple of tables with thousands of records, not billions, and, for now, a few users (40-50).. a good schema and a few well done queries can do the trick.
I guess some people are so used to see sluggish application that as soon as they see something that goes faster than average they think it must use some cool latest big tech.
More than once I've run across situations of everyone sitting around going "we need to scale up the server or switch the database or rewrite some of this frontend stuff or something it's so slow and there's nothing we can do" and solved their intractable performance problems by just adding an index in MySQL that actually covers the columns they're querying on.
Lots of people seem to want some silver bullet magical software/technology to solve their problems instead of learning how to use the tools they have. That's not software development.
kdb+ is very cool. Just a single binary, less than 1 MB and yet an extremely fast and powerful tool for analytics. There's a very nice talk on CMU's series of database videos if you want to know more. https://www.youtube.com/watch?v=AiGdfmxEP68
And it scores extremely well on the 1 billion taxi drives benchmarks (no cluster but some seriously big machine): http://tech.marksblogg.com/billion-nyc-taxi-kdb.html
Well, I'm not sure "seniority" is the right word - the more tech stuff you know, in general, the _less_ seniority you're going to achieve in terms of org charts, decent seating, respect and actual pull within an organization. You can achieve job security and higher pay that way, though.
I once converted a simulation into cython from plain old python.
Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).
Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.
Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.
Sometimes it's worth it to just step back and experiment a bit.
Yeah, back when I was in gamedev land and multi-cores started coming on the scene it was "Multithread ALL THE THINGS". Shortly there after people realized how nasty cache invalidation is when two cores are contending over one line. So you can have the same issue show up even in a single machine scenario.
Good understanding of data access patterns and the right algorithm go a long way in both spaces as well.
Even earlier, when SMP was hitting the server room but still far from the desktop, there was a similar phenomenon of breaking everything down to use ever finer-grain locks ... until the locking overhead (and errors) outweighed any benefit from parallelism. Over time, people learned to think about expected levels of parallelism, contention, etc. and "right size" their locks accordingly.
Computing history's not a circle, but it's damn sure a spiral.
What kind of games? I always thought that e.g. network-synced simulation code in RTSes or MMOs would be extremely amenable to multithreading, since you could just treat it like a cellular automata: slicing up the board into tiles, assigning each tile to a NUMA node, and having the simulation-tick algorithm output what units or particles have traversed into neighbouring tiles during the tick, such that they'll be pushed as messages to that neighbouring tile's NUMA node before the next tick.
(Obviously this wouldn't work for FPSes, though, since hitscan weapons mess the independence-of-tiles logic all up.)
Hadoop has its time and place. I love using hive and watching the consumed CPU counter tick up. When I get our cluster to myself and it steps 1hr of CPU every second it's quite a sight to see.
Yeah, I got a bit of a shock when I first used our one and my fifteen minute query took two months of CPU time. Thought maybe I'd done something wrong until I was assured that was quite normal.
Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)
csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.
> The COST of a given platform for a given problem is the
hardware configuration required before the platform outperforms a competent single-threaded implementation.
COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual
performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
Keep in mind it's for graph processing - Hadoop/HDFS still shines for data-intensive streaming workloads like indexing a few hundred terabytes of data - where you can exploit the parallel disk io of all disks in the cluster - if you have 20 machines with 8 disks in a cluster that's 20 * 8 * 100mbyte/s = 16gbyte/s throughput - for 200 machines it's 160gbyte/s.
However for iterative calculations like pagerank the overhead for distributing the problems is often not worth it.
For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.
The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.
There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.
Normal developer behavior has gone from "optimize everything for machine usage" (cpu time, memory, etc.) to "optimize everything for developer convenience". The former is frequently inappropriate, but the latter is, as well.
(And some would say that it then went to "optimize everything for resume keywords," which is almost always inappropriate, but I don't want to be too cynical.)
The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.
I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.
Large companies would benefit from dev not over-engineering simple apps and spending less time on their own tools and ticket systems, and spending more time instead on solving/automating more problems for the company.
[1] https://www.tamingdata.com/wp-content/uploads/2010/07/tree-s...
My response: nope, MySQL! (plus an ORM, a very hated programming language and very few config optimisations on the server side).
This project DB has a couple of tables with thousands of records, not billions, and, for now, a few users (40-50).. a good schema and a few well done queries can do the trick.
I guess some people are so used to see sluggish application that as soon as they see something that goes faster than average they think it must use some cool latest big tech.
Lots of people seem to want some silver bullet magical software/technology to solve their problems instead of learning how to use the tools they have. That's not software development.
Deleted Comment
Well, I'm not sure "seniority" is the right word - the more tech stuff you know, in general, the _less_ seniority you're going to achieve in terms of org charts, decent seating, respect and actual pull within an organization. You can achieve job security and higher pay that way, though.
Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).
Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.
Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.
Sometimes it's worth it to just step back and experiment a bit.
Good understanding of data access patterns and the right algorithm go a long way in both spaces as well.
Computing history's not a circle, but it's damn sure a spiral.
(Obviously this wouldn't work for FPSes, though, since hitscan weapons mess the independence-of-tiles logic all up.)
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)
Specific command:
LC_ALL=c sort -n filename -r > output
> The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
[1] http://www.frankmcsherry.org/assets/COST.pdf
[2] https://news.ycombinator.com/item?id=11855594
However for iterative calculations like pagerank the overhead for distributing the problems is often not worth it.
The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.
There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.
(And some would say that it then went to "optimize everything for resume keywords," which is almost always inappropriate, but I don't want to be too cynical.)
iOS used to kill processes above 50 megs of RAM usage. Good times!
2016: https://news.ycombinator.com/item?id=12472905
2015: https://news.ycombinator.com/item?id=8908462
2018: https://news.ycombinator.com/item?id=16810756
I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.