The Magic of Sampling, and Its Limitations

It has always surprised me that many technology professionals (and business professionals in general) don't have a strong intuition for the power of sampling. For example, in this case, the author states: "With 100 samples, our estimates are accurate to within about 5%. The magic of sampling is that we can derive accurate estimates about a very large population using a relatively small number of samples. In the last scenario (100 billion M&MS), we have 1% accuracy despite only sampling 0.00001% of the M&Ms."

I bet many would think n=100 would be worthless once the population reaches millions, or especially billions.

One HN-related piece of evidence for that is when I pointed out what margin of error would be for a n=164 survey sample, I got downvoted hard! https://news.ycombinator.com/item?id=8050801

But I saw this hundreds of times talking to customers when I ran a survey sampling product out of YC.

travisjungroth · 3 years ago

The impact of sample size completely defies human intuition. I give approximately zero weight to any opinion about a sample size that doesn’t come with the formula, except for some coworkers talking about specific topics where they’ve run the numbers so many times.

The problem is the sample size relationship with power is at a square, then it involves the effect size and the variance. Quadratic relationships are unintuitive, many order of magnitude differences are unintuitive, and variance is unintuitive. So it’s like a superformula for the human brain to not guess accurately.

I write experimentation software, and with samples in the millions, data scientists still want more power. Then you run some internal experiment with n=20 and it’s like “oh yeah, super significant”.

rcxdude · 3 years ago

I think the issue is that it isn't really taught: in science class and in popular science it's often mentioned that you should be on the lookout for small sample sizes in studies as a measure of quality, but it doesn't really go into much more detail than that. And in my mathematics education I wasn't taught about this effect until university (despite doing a lot of stats exercises involving sampling). And sample size is pretty much always an easy thing to pull out of a paper without much familiarity with the subject of the paper, while sample bias (a much bigger issue) can be much harder to address (though at least some people seem to be getting the idea that if you are sampling from university students like many papers do it's not exactly representitive of the general population).

dietrichepp · 3 years ago

Statistics has got to have the highest density of paradoxes and unintuitive results. Math and quantum mechanics seem like they have plenty of paradoxes, but then when I studied statistics, it felt like I was blindsided by new paradoxes every month for a year.

lodi · 3 years ago

I was thinking about this last week[1] and I think that both the "math" people and "common sense" people are correct in a sense, and talking past each other. The math people are of course mathematically correct within the limits of the constructed model, given all the assumptions of perfectly random sampling, no systemic error, etc. Meanwhile common sense people are correct in a practical sense: small samples are vulnerable to sampling error, p-hacking, outright fraud, etc. Even before you're aware of the exact mechanisms by which things can go wrong you intuitively know that small/cheap studies are more vulnerable to some kind of honest human error or dishonest... shenanigans.

---

[1] I was listening to a podcast where trolley problems were brought up and the speaker was lamenting how clearly "unethical" and "irrational" your evolved intuition is given that most people will let the train hit 10 men working on the tracks than to divert it and kill 1 innocent. Trolley problems are intellectually interesting for various reasons but jumping to that conclusion is clearly absurd. Your intuitions are shaped by millions of years of genetic and social evolution to precisely be most rational for actual real-life problems. If you were actually standing at that switch you'd be thinking...

* do I actually trust my eyes in this situation? Are the workers on a parallel track and there's no actual problem here?

* if I pull the switch, will it derail the train and kill N+1 people instead of the 10.

* will the workers just notice the train in time and scurry off the track? Or will the train just stop? How good are brakes on a train anyway?

* how much time do judges and juries spend solving trolley problems?

... and while you were paralyzed thinking about these and a million other things, whatever was about to happen would happen and there would be no trolley problem.

taeric · 3 years ago

Yeah, I confess I'm having a hard time understanding this. How much of the underlying population do you have to accurately know for such a small sample to be worth so much?

Edit: I see that the article describes some of the limitations. I'm curious on how to work with unknown populations. That said, it does have me revisiting some ideas. Looking forward to it.

sfifs · 3 years ago

For people who can write code, the simplest exercise to convince yourself of foundational statistics is simulations.

Create a simulated population with some distribution of a metric & run multiple sampling simulations. You'll be surprised. You can even put in sampling biases as test the impact.

Monte Carlo simulations are a surprisingly powerful tool. I once discovered that FAANG data scientists were mis-understanding statistical significance in a reporting product they made by half an order of magnitude because they didn't understand the impact of observationalmethodology and sampling bias in their product. In my company, we set our own thresholds much larger than what the product recommended.

CrazyStat · 3 years ago

> How much of the underlying population do you have to accurately know for such a small sample to be worth so much?

That's the magic of random sampling, you don't need to accurately know anything about the underlying population. If you do know things about the underlying population then you can do clever things like stratified sampling to get even more accurate measurements, but that's not necessary. The magic is that a randomly selected group of 100/1000/10000 is unlikely to be too different from the population as a whole, no matter what that population looks like.

You do have to be able to sample randomly--truly randomly[1]--from the population, though, and that's often an issue. Picking 100 people randomly from the population of "likely voters in the next US presidential election" is a very nontrivial thing. To start with, that population is not even very well defined; who is likely to vote changes over time and is difficult to pin down. Pollsters do various things to try to account for this, but if they fail to predict say a surge in young voters their numbers will end up being off.

Even if the population is clearly defined, it's not easy to survey a truly random sample from it. Some people are hard to reach. Some people don't want to talk to you, and whether or not they're willing to talk to you might be correlated with the thing you're interested in (like who they plan to vote for). You can do things to try to correct for that, but again if you get that wrong (and it's very hard to get right) your estimates will be off.

And of course, if you're interested in things that are rare, like third party voters, you need a much larger sample to get an accurate read. If you sample 100 likely voters there's a pretty good chance you won't get a single person who plans to vote for the Libertarian Party candidate.

[1] For the most basic form of random sampling, simple random sampling, you need not just every individual in the population to have the same probability of getting sampled, but every possible sample (i.e. every possible set of 100) needs to have the same probability of being sampled.

cousin_it · 3 years ago

Maybe as a way to rewire your intuition a bit, imagine that the population is infinite. A smooth curve. Yet if you sample 100 random points on the curve, you'll know quite a lot about it.

kevin_thibedeau · 3 years ago

> I bet many would think n=100 would be worthless once the population reaches millions, or especially billions.

That depends on a uniform distribution of the population and an unbiased sampling method. One of the polls in the 2016 US presidential election [1] would shift Trump's position by a full percentage point based on input from a single man depending on which week he participated in the panel.

[1] https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-ol...

rcxdude · 3 years ago

Yes, but if your sampling is biased then the solution is not simply a larger sample size. The only way in which larger sample sizes help is giving more power to any methods you use to try to control for sampling bias.

Dylan16807 · 3 years ago

> a full percentage point

If your goal is to be within 10%, that's not a problem.