Readit News logoReadit News
simsla · 7 months ago
This relates to one of my biggest pet peeves.

People interpret "statistically significant" to mean "notable"/"meaningful". I detected a difference, and statistics say that it matters. That's the wrong way to think about things.

Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Whether the measured difference is significant in the sense of "meaningful" is a value judgement that we / stakeholders should impose on top of that, usually based on the magnitude of the measured difference, not the statistical significance.

It sounds obvious, but this is one of the most common fallacies I observe in industry and a lot of science.

For example: "This intervention causes an uplift in [metric] with p<0.001. High statistical significance! The uplift: 0.000001%." Meaningful? Probably not.

mustaphah · 7 months ago
You're spot on that significant ≠ meaningful effect. But I'd push back slightly on the example. A very low p-value doesn't always imply a meaningful effect, but it's not independent of effect size either. A p-value comes from a test statistic that's basically:

(effect size) / (noise / sqrt(n))

Note that bigger test statistic means smaller p-value.

So very low p-values usually come from bigger effects or from very large sample sizes (n). That's why you can technically get p<0.001 with a microscopic effect, but only if you have astronomical sample sizes. In most empirical studies, though, p<0.001 does suggest the effect is going to be large because there are practical limits on the sample size.

specproc · 7 months ago
The challenge is that datasets are just much bigger now. These tools grew up in a world where n=2000 was considered pretty solid. I do a lot of work with social science types, and that's still a decent sized survey.

I'm regularly working with datasets in the hundreds of thousands to millions, and that's small fry compared with what's out there.

The use of regression, for me at least, is not getting that p-gotcha for a paper, but as a posh pivot table that accounts for all the variables at once.

pebbly_bread · 7 months ago
Depending on the nature of the study, there's lots of scientific disciplines where it's trivial to get populations in the millions. I got to see a fresh new student's poster where they had a p-value in the range of 10^-146 because every cell in their experiment was counted as it's own sample.
amelius · 7 months ago
https://pmc.ncbi.nlm.nih.gov/articles/PMC3444174/

> Using Effect Size—or Why the P Value Is Not Enough

> Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

– Gene V. Glass

tryitnow · 7 months ago
Agreed. However, I think you're being overly charitable in calling it a "pet peeve", it's more like a pathological misunderstanding of stats that leads to a lot of bad outcomes especially in popular wellness media.

As an example, read just about any health or nutrition research article referenced in popular media and there's very often a pretty weak effect size even though they've achieved "statistical significance." People then end up making big changes to their lifestyles and habits based on research that really does not justify those changes.

jpcompartir · 7 months ago
^

And if we increase N enough we will be able to find these 'good measurements' and 'statistically significant differences' everywhere.

Worse still if we did not agree in advance what hypotheses we were testing, and go looking back through historical data to find 'statistically significant' correlations.

ants_everywhere · 7 months ago
Which means that statistical significance is really a measure of whether N is big enough
V__ · 7 months ago
I really like this video [1] from 3blue1brown, where he proposes to think about significance as a way to update the probability. One positive test (or in this analog a study) updates the probability by X % and thus you nearly always need more tests (or studies) for a 'meaningful' judgment.

[1] https://www.youtube.com/watch?v=lG4VkPoG3ko

kqr · 7 months ago
To add nuance, it is not that bad. Given reasonable levels of statistical power, experiments cannot show meaningless effect sizes with statistical significance. Of course, some people design experiments at power levels way beyond what's useful, and this is perhaps even more true when it comes to things where big data is available (like website analytics), but I would argue the problem is the unreasonable power level, rather than a problem with statistical significance itself.

When wielded correctly, statistical significance is a useful guide both to what's a real signal worth further investigation, and it filters out meaningless effect sizes.

A bigger problem even when statistical significance is used right is publication bias. If, out of 100 experiments, we only get to see the 7 that were significant, we already have a false:true ratio of 5:2 in the results we see – even though all are presented as true.

ants_everywhere · 7 months ago
> Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Significance does not tell you this. The p-value can be arbitrarily close to 0 while the probability of the null hypothesis being true is simultaneously arbitrarily close to one

wat10000 · 7 months ago
Right. The meaning of p-value is, in a world where there is no effect, what is the probability of getting the result you got purely by random chance? It doesn’t directly tell you anything about whether this is such a world or not.
tomrod · 7 months ago
This is sort of the basis of econometrics, as well as a driving thought behind causal inference.

Econometrics cares not only about statistical significance but also usefulness/economic usefulness.

Causal inference builds on base statistics and ML, but its strength lies in how it uses design and assumptions to isolate causality. Tools like sensitivity analysis, robustness checks, and falsification tests help assess whether the causal story holds up. My one beef is that these tools still lean heavily on the assumption that the underlying theoretical model is correctly specified. In other words, causal inference helps stress-test assumptions, but it doesn’t always provide a clear way to judge whether one theoretical framework is more valid than another!

taneq · 7 months ago
I’d say rather that “statistically significance” is a measure of surprise. It’s saying “If this default (the null hypothesis) is true, how surprised would I be to make these observations?”
kqr · 7 months ago
Maybe you can think of it as saying "should I be surprised" but certainly not "how surprised should I be". The magnitude of the p-value is a function of sample size. It is not an odds ratio for updating your beliefs.
esafak · 7 months ago
prasadjoglekar · 7 months ago
For all the shit that HN gives to MBAs, one thing they instill into you during the Managerial Stats class is Stag Sig not the same as Managerial Sig.
nathan_compton · 7 months ago
Really classic "rationalist" style writing: a soup of correct observations about statistical phenomena with chunks of weird political bullshit thrown in here and there. For example: "On a more contemporary note, these theoretical & empirical considerations also throw doubt on concerns about ‘algorithmic bias’ or inferences drawing on ‘protected classes’: not drawing on them may not be desirable, possible, or even meaningful."

This is such a bizarre sentence. The way its tossed in, not explained in any way, not supported by references, etc. Like I guess the implication being made is something like "because there is a hidden latent variable that determines criminality and we can never escape from correlations with it, its ok to use "is_black" in our black box model which decides if someone is going to get parole? Ridiculous. Does this really "throw doubt" on whether we should care about this?

The concerns about how models work are deeper than the statistical challenges of creating or interpreting them. For one thing, all the degrees of freedom we include in our model selection process allow us to construct models which do anything that we want. If we see a parole model which includes "likes_hiphop" as an explanatory variable we ought to ask ourselves who decided that should be there and whether there was an agenda at play beyond "producing the best model possible."

These concerns about everything being correlated actually warrant much more careful understanding about the political ramifications of how and what we choose to model and based on which variables, because they tell us that in almost any non-trivial case a model is at least partly necessarily a political object almost certainly consciously or subconsciously decorated with some conception of how the world is or ought to be explained.

zahlman · 7 months ago
> This is such a bizarre sentence. The way its tossed in, not explained in any way,

It reads naturally in context and is explained by the foregoing text. For example, the phrase "these theoretical & empirical considerations" refers to theoretical and empirical considerations described above. The basic idea is that, because everything correlates with everything else, you can't just look at correlations and infer that they're more than incidental. The political implications are not at all "weird", and follow naturally. The author observes that social scientists build complex models and observe huge amounts of variables, which allows them to find correlations that support their hypothesis; but these correlations, exactly because they can be found everywhere, are not anywhere near as solid evidence as they are presented as being.

> Like I guess the implication being made is something like "because there is a hidden latent variable that determines criminality and we can never escape from correlations with it, its ok to use "is_black" in our black box model which decides if someone is going to get parole?

No, not at all. The implication is that we cannot conclude that the black box model actually has an "is_black" variable, even if it is observed to have disparate impact on black people.

nathan_compton · 7 months ago
Sorry, but I don't think that is a reasonable read. The phrase "not drawing on them may not be desirable, possible, or even meaningful" is a political statement except perhaps for "possible," which is just a flat statement that its hard to separate causal variables from non-causal ones.

Nothing in the statistical observation that variables tend to be correlated suggests we should somehow reject the moral perspective that that its desirable for a model to be based on causal rather than merely correlated variables, even if finding such variables is difficult or even, impossible to do perfectly. And its certainly also _meaningful_ to do so, even if there are statistical challenges. A model based on "socioeconomic status" has a totally different social meaning than one based on race, even if we cannot fully disentangle the two statistically. He is mixing up statistical and social, moral and even philosophical questions in a way which is, in my opinion, misleading.

pcrh · 7 months ago
"Rationalists" do seem to have a fetish for ranking people and groups of people. Oddly enough, they frequently use poorly performed studies and under-powered data to reach their conclusions about genetics and IQ especially.
nxobject · 7 months ago
> For example: "On a more contemporary note, these theoretical & empirical considerations also throw doubt on concerns about ‘algorithmic bias’ or inferences drawing on ‘protected classes’: not drawing on them may not be desirable, possible, or even meaningful."

As much as I do think that good, parsimonious social science modeling _requires_ theoretical commitments, the test is whether TFA would say the same thing about political cause du jour - say, `is_white` in hiring in an organization that does outreach to minority communities.

ml-anon · 7 months ago
Yes this is gwern to a "T". Overwhelm with a r/iamverysmart screed whilst insidiously inserting baseless speculation and opinion as fact as if the references provided cover those too. Weirdly the scaling/AI community loves him.
senko · 7 months ago
The article missed the chance to include the quote from that standard compendium of information and wisdom, The Hitchhiker's Guide to the Galaxy:

> Since every piece of matter in the Universe is in some way affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation — every sun, every planet, their orbits, their composition and their economic and social history from, say, one small piece of fairy cake.

sayamqazi · 7 months ago
Wouldnt you need the T_zero configuration of the universe for this to work?

Given different T_zero configs of matter and energies T_current would be different. and there are many pathways that could lead to same physical configuration (position + energies etc) with different (Universe minus cake) configurations.

Also we are assuming there is no non-deterministic processed happening at all.

senko · 7 months ago
I am assuming integrating over all possible configurations would be a component of The Total Perspective Vortex.

After all, Feynman showed this is in principle possible, even with local nondeterminism.

(this being a text medium with a high probability of another commenter misunderstanding my intent, I must end this with a note that I am, of course, BSing :)

eru · 7 months ago
> Wouldnt you need the T_zero configuration of the universe for this to work?

Why? We learn about the past by looking at the present all the time. We also learn about the future by looking at the present.

> Also we are assuming there is no non-deterministic processed happening at all.

Depends on the kind of non-determinism. If there's randomness, you 'just' deal with probability distributions instead. Since you have measurement error anyway, you need to do that anyway.

There are other forms of non-determinism, of course.

jerf · 7 months ago
The real problem is you need a real-number-valued universe for this to work, where the measurer needs access to the full real values [1]. In our universe, which has a Planck size and Planck time and related limits, the statement is simply untrue. Even if you knew every last detail about a piece of fairy cake, whatever "every last detail" may actually be, and even if the universe is for some reason deterministic, you still could not derive the entire rest of the universe from it correctly. Some sort of perfect intelligence with access to massive amounts of computation may be able to derive a great deal more than you realize, especially about the environment in the vicinity of the cake, but it couldn't derive the entire universe.

[1]: Arguments are ongoing about whether the universe has "real" numbers (in the mathematical sense) or not. However it is undeniable the Planck constants still provide a practical barrier to any hypothetical real valued numbers in the universe that make them in practice inaccessible.

prox · 7 months ago
In Buddhism we have dependent origination : https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da
lioeters · 7 months ago
Also the concept of implicate order, proposed by the theoretical physicist David Bohm.

> Bohm employed the hologram as a means of characterising implicate order, noting that each region of a photographic plate in which a hologram is observable contains within it the whole three-dimensional image, which can be viewed from a range of perspectives.

> That is, each region contains a whole and undivided image.

> "There is the germ of a new notion of order here. This order is not to be understood solely in terms of a regular arrangement of objects (e.g., in rows) or as a regular arrangement of events (e.g., in a series). Rather, a total order is contained, in some implicit sense, in each region of space and time."

> "Now, the word 'implicit' is based on the verb 'to implicate'. This means 'to fold inward' ... so we may be led to explore the notion that in some sense each region contains a total structure 'enfolded' within it."

euroderf · 7 months ago
Particles do not suffer from predestination, do they ?
apples_oranges · 7 months ago
People didn't always use statistics to discover truths about the world.

This, once developed, just happened to be a useful method. But given the abuse using those methods, and the proliferation of stupidity disguised as intelligence, it's always fitting to question it, and this time with this correlation noise observation.

Logic, fundamental knowledge about domains, you need that first. Just counting things without understanding them in at least one or two other ways, is a tempting invitation for misleading conclusions.

kqr · 7 months ago
> People didn't always use statistics to discover truths about the world.

And they were much, much worse off for it. Logic does not let you learn anything new. All logic allows you to do is restate what you already know. Fundamental knowledge comes from experience or experiments, which need to be interpreted through a statistical lens because observations are never perfect.

Before statistics, our alternatives for understanding the world was (a) rich people sitting down and thinking deeply about how things could be, (b) charismatic people standing up and giving sermons on how they would like things to be, or (c) clever people guessing things right every now and then.

With statistics, we have to a large degree mechanised the process of learning how the world works, and anyone sensible can participate, and they can know with reasonable certainty whether they are right or wrong. It was impossible to prove a philosopher or a clergyman wrong!

That said, I think I agree with your overall point. One of the strengths of statistical reasoning is what's sometimes called intercomparison, the fact that we can draw conclusions from differences between processes without understanding anything about those processes. This is also a weakness because it makes it easy to accidentally or intentionally manipulate results.

aeonik · 7 months ago
Discovering that two different-seeming statements reduce to the same truth is new knowledge.
mnky9800n · 7 months ago
There is a quote from George Lucas where he talks about how when new things come into a society people have a tend to over do it.

https://www.youtube.com/watch?v=VEIrQUXm_hY

arduanika · 7 months ago
Quoted repeatedly by Mr. Plinkett in his new drop yesterday.

https://www.youtube.com/watch?v=0xeMak4RqJA

apples_oranges · 7 months ago
Nice, yeah. With many movies one has to ask: What's the point? Especially all Disney Star Wars..
ricardobayes · 7 months ago
Not commenting on the topic at hand, but my goodness, what a beautiful blog. That drop cap, the inline comments on the right hand side that appear on larger screens, the progress bar, chef's kiss. This is how a love project looks like.
scarmig · 7 months ago
You may be interested in gwern's dropcap article:

https://gwern.net/dropcap

Evidlo · 7 months ago
This is such a massive article. I wish I had the ability to grind out treatises like that. Looking at other content on the guy's website, he must be like a machine.
kqr · 7 months ago
IIRC Gwern lives extremely frugally somewhere remote and is thus able to spend a lot of time on private research.
tux3 · 7 months ago
IIRC people funded moving gwern to the bay not too long ago.
lazyasciiart · 7 months ago
That and early bitcoin adoption. There’s a short bio somewhere on the site.
pas · 7 months ago
lots of time, many iterations, affinity for the hard questions, some expertise in research (and Haskell). oh, and also it helps if someone is funding your little endeavor :)
aswegs8 · 7 months ago
I wish I would be even able to read things like that.
tmulc18 · 7 months ago
gwern is goated
derbOac · 7 months ago
Arguments like this have been around for decades. I think it's important to keep in mind — critical even.

At the same time, as I've been forced to wrestle with it more in my work, I've increasingly felt that it's sort of empty and unhelpful. "Crud" does happen in patterns, like a kind of statistical cosmic background radiation — it's not meaningless. Sometimes it's important to understand it, and treating it as such gets no one anywhere. Sometimes the associations are difficult to explain easily when you try to pick it apart, and other times I think they're key to understanding uncontrolled confounds that should be controlled for.

As much as this background association is present too, it's not always there. Sometimes things do have zero association.

Also, trying to come up with a "meaningful" effect size that's not zero is pretty arbitrary and subjective.

There's probably more productive ways of framing the phenomenon.

dang · 7 months ago
Correlated. Others?

Everything Is Correlated - https://news.ycombinator.com/item?id=19797844 - May 2019 (53 comments)

stouset · 7 months ago
Correlated, you mean?
dang · 7 months ago
That's so good I had to nick it for the parent comment. Thanks!

(It said "Related" before, of course: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que....)

pnt12 · 7 months ago
Those would be all articles posted in HN :)