Ask HN: What Happened to Big Data?

It became ubiquitous.

Now it is in many places. Enterprises use it each moment.

A laptop hard disk is now capable of holding databases with tens of millions of rows.

Traditional "Data Science" and modern Deep Learning rely entirely on it. Millions of datapoints are used to create models everyday.

A sensor on human wrist collects and stores thousands of data points each day.

So do refrigrators, cars, and your washing machine with ubiquity of IoT.

Giant tech cos use billions of rows each day to show users products, or sell their attention as products.

Big Data became ubiquitous. And it became so common that no nody calls it that anymore.

Tools like BigQuery, Dask, and even Pandas and SQL can handle hundreds of thousands to hundreds of millions of rows (or other structure) with normal, regular programming, command, etc.

zasdffaa · 3 years ago

> A laptop hard disk is now capable of holding databases with tens of millions of rows.

> A sensor on human wrist collects and stores thousands of data points each day.

I feel we have a very different view of what comprises big data :)

rg111 · 3 years ago

Yeah I work with hundreds of GBs of data every day. I have worked on a dataset with 40 million images in the past. I am also aware of OpenAI, Google AI, etc. train on billions of images. Internally, I am aware of Amazon, Google, META handling really large datasets.

But that is now.

If you are old enough you must remember the early years of big data hype. It was not far from millions or tens of millions of rows.

Re: sensors on fitbits, I thought everyone would read between the lines, and consider hundreds of thousands of these devices sold every year (every month?) will definitely amount to "big data".

Either these companies are plainly hoarding all of it and running some kind of analysis, or they maybe are doing federated learning. From the cos' standpoint, yes, it is big data.

Most companies probably realized that they don't have Big Data problems because they only have a limited amount of data which you actually can process in an acceptable amount of time on a single Postgres instance. Distributed data processing has a huge upfront tax and you really only want to be doing it if the data set is enormous.

I guess it is similar to other technologies which most companies or developers would really never need due to their limited scale like distributed databases, NoSQL or microservices: It is interesting technology and engineers would like to get their hands on it because that's what the big boys play with, even if they don't really need it. In the meantime the industry hypes it because the technology is difficult so they know that they can make money doing consulting.

I'm not saying that it is not useful technology, I work at a company where we had the need to go from Postgres to "Big Data" tooling. But for tons of businesses it just doesn't make any sense. And even in our case one of the questions I have most frequently is: What business decision are you taking based on processing this enormous amount of data? Can we not take the same decision based on less data?

SOLAR_FIELDS · 3 years ago

It’s been the same experience for me. Until you get into petabyte levels of data a single replicated and vertically scaled pg is probably going to be just fine. Quite a few orgs probably realized that eventually after sinking $MM into some Hadoop/Spark setup in the mid 2010’s.

If what you’re doing is

1. Easily parallelizable

2. CPU intensive

3. In 10’s of petabytes or more

Then one of these machine gun like setups makes sense in 2022. Otherwise YAGNI (you aren’t gonna need it)

ryanjkirk · 3 years ago

"For Basecamp, we take a somewhat different path. We are on record about our feelings about sharding. We prefer to use hardware to scale our databases as long as we can, in order to defer the complexity that is involved in partitioning them as long as possible — with any luck, indefinitely."

— Mark Imbriaco, 37Signals

These quotes are from 2009 and 2010, and yet here most of us are in 2022, having learned the lesson the hard way over the last decade that there is no refuting this simple logic. I'll add my own truism: All else being equal, designing and maintaining simpler systems will always cost less than complex ones.

quote references:

https://signalvnoise.com/posts/2479-nuts-bolts-database-serv...http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on...

jasode · 3 years ago

To echo what others have said, the practice of big data has become so normalized that the language term "big data" -- as a new thing to call attention to itself -- is not needed as much as before.

Similar language history that happened to terms like "dynamic web" or "interactive web". In the late 1990s when Javascript started to be heavily used, we called attention to that new trend the "dynamic web". Today, the "interactive web" phrase has mostly gone away. But that doesn't mean that Javascript-enabled web pages were a fad. On the contrary, we're using more Javascript than ever before. We just take it as a given that the web is interactive.

Examples of rise & fall of "interactive web" in language use that peaked around 2004:

https://books.google.com/ngrams/graph?content=dynamic+web&ye...

https://books.google.com/ngrams/graph?content=interactive+we...

I am too young to remember "information superhighway".

But just the transfer of text files over HTTP or some other protocol was a buzzword some day!

pentagrama · 3 years ago

Responsive web design is other of those.

Cthulhu_ · 3 years ago

Responsive web design, mobile-first design, progressive web-app, there's a lot of phrases that have done the rounds over the past decade or so. Dynamic HTML (DHTML) was a thing for a while too, early drag-and-drop and other such interactive things on a website.

yen223 · 3 years ago

Remember "multimedia" computers? Those were the days

ghaff · 3 years ago

That's a big part. It's also the case that a lot of (but not all) of the technologies associated with Big Data were found to be less than broadly useful (many of the NoSQL database technologies, arguably Hadoop, etc.)

lbriner · 3 years ago

Like a lot of fads in IT, Big Data sounded like "if you have a lot of data you can monetize it" so companies threw 7+ figures at the technology and then realised that you can have too much data to know what to do with and couldn't really monetize it (obviously some did/still do). Even at a simple level, working at a data collection company, it is very clear that lots of people want to collect as much as possible only then to do precisely nothing useful with the results.

Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.

BeefWellington · 3 years ago

> Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.

My favorite of one of these was an ML model demo we got was the incredibly insightful analysis: "As customer dissatisfaction increases, customer satisfaction decreases."

monkeybutton · 3 years ago

Best one I've encountered is: "Customers whose accounts are up for renewal, are the ones most at risk of churning"

brainwipe · 3 years ago

Rather than the skills, the difficulty most companies have with ML is that they don't understand their data before they begin. There's a data engineering piece of work required before the data science.

ironchef · 3 years ago

It's just considered "data" these days. We just look at the Vs of the data and adjust based on those. High velocity? Do X. High volume? Accommodate Y. High variety? ... The other side of things is the underlying data quality often had tons of issues, so there's been a lot of focus on the data observability (which isn't sexy at all).

Still tons of folks out there using Hadoop (ew), Snowflake, etc. New technologies coming out include things like Trino, Apache Iceberg, etc. So it's there ... just no one cares about the moniker .. just getting things done.

marcinzm · 3 years ago

It's simply become the norm. Companies store and analyze lots of data all the time. It's no longer special but simply table stakes. Look at the valuations of Snowflake and Databricks.

jhoechtl · 3 years ago

I disagree. Big data came associated with a new swap of algorithms. To big to handle? Use new algorithms, maybe not 100% accurate but can handle the load. And streaming data as opposed to static data.

The are a lot of approaches like Change Data Capture CDC or HyperLogLog - but the norm? Far from.

I think the marketing BS fell out of fashion when every database designer became a data scientist, but that's another issue.

Big Data was about being able to store and query data beyond the limits of a single machine or existing database. The point being to store as much data as you possibly can and then extract value out of it. That grew out of Hadoop/MapReduce which let you cheaply store and access data that doesn't fit into one machine. Streaming was not part of the initial marketing pitch.

That said if you want to do streaming nowadays then you just integrate with Segment. If you want to track your database then you can dump data using Fivetran. If you want to track client events in excruciating detail then you can use Fullstory/Heap to do so in real time. That's all now table stakes for any company and outsourced to those platforms.

Rastonbury · 3 years ago

I'm non-technical on business/strategy side of things at a tech company, I have no idea what any of those things but I interact with my company's datalake/warehouse. I don't even know the right term for it but its the source of truth of all reports, dashboards and presentations. I don't know what they used before this but I imagine it was a bit more painful to use

rorymalcolm · 3 years ago

Those algorithms and improvements in large data processing got bundled away into a platform/infra layer a developer or user interfaces with unaware of what's going on in the background to produce the results they want.

aeyes · 3 years ago

benjaminwootton · 3 years ago

In addition to the skeptical comments, I think infrastructure and best practice also caught up such that what used to be big data is not so big anymore.

Storing data on S3 or using BigQuery remove a lot of the challenges as opposed to doing this stuff in the data centre. You then also have services such as EMR, Databricks and Snowflakes to acquire the tooling and platforms as an IaaS/SaaS. The actual work then moves up the stack.

Businesses are doing more with data than ever before and the volumes are growing. I just think the challenge moved on from managing large datasets as result of new tooling, infrastructure and practices.

gregw2 · 3 years ago

If you used big Data tech (Hadoop/yarn/spark) when you didn't actually have big Data (PB), it was slower than columnar databases so the shine wore off.

beckingz · 3 years ago

This. The falling cost of storage and increasing speed of SSDs means that for most use cases a column store database is significantly faster and cheaper.

Plus people started wising up to COST.

Yeah. Companies don't like it when their expensive fancy new Hive/Hadoop cluster takes longer to run an even moderately complex query that's core to their business than their existing Oracle or SQL Server DB.

For some reason there's ridiculous levels of FOMO in executive ranks, so any new trend is something they need to jump on like it will be what keeps their company around in 10 years. The result of this is fad-jumping, which I've seen happen from Big Data to ML to Blockchain, costing companies millions that could have been better invested in their own products or offerings and actually competing better. It's a really expensive educational cost for leadership IMO.