A New Coefficient of Correlation

There are so many super cool aspects to this!

- it's always fun when a new equation or formula is discovered, doubly so when it is very practical

- actually really easy to wrap your head around it

- seems very computationally efficient. Basically boils down to sorts and distance

- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.

- looks useful in both LTI and non stationary signal analysis

- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.

thesz · 4 years ago

You can update Pearson correlation online, you cannot update online OP correlation coefficient.

Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

Thus, it is not computationally efficient.

To break ties randomly one has to have good random number generator, which is a problem in itself.

Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).

But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.

For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].

[1] https://aclanthology.org/P02-1023.pdf

kortex · 4 years ago

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

> Thus, it is not computationally efficient.

That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.

Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.

I didn't know about sorting networks, that is fascinating!

jazzkingrt · 4 years ago

Can you explain in more detail why this can't be updated online?

We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:

https://en.wikipedia.org/wiki/Order_statistic_tree

But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.

hansvm · 4 years ago

Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:

(1) Rank coefficients can absolutely be updated online.

(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.

(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.

omnicognate · 4 years ago

> Thus, it is not computationally efficient.

Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.

TeeMassive · 4 years ago

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.

That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.

asxd · 4 years ago

Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.

eutectic · 4 years ago

You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.

Also, I would say that random number generation is a well-solved problem.

petters · 4 years ago

> Finally, if you want to have differentiable version of this correlation coefficient

Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.

I have a very naive question: What are the downsides of estimating mutual information instead?

I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).

So ... what am I not understanding?

tibbar · 4 years ago

Different coefficients help you look at different kinds of relationships. For example, Pearson's R tells you about linear relationships between variables -- it's closely tied to the covariance: "how useful is it to draw a line through these data points, and how accurate is interpolating likely to be?".

Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)

Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.

TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!

tbenst · 4 years ago

MI is quite useful and widely used. It typically requires binning data though when distributions are unknown / empirically estimated. This approach is a rank-based score, more similar to Spearman correlation than Pearson. This allows for nonlinear relationships between the two variables.

A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.

spekcular · 4 years ago

Freely available version of the review paper, for those who want it: https://arxiv.org/abs/2008.11619

mellavora · 4 years ago

Instead of mutual information, can I recommend the Transfer Entropy?

https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.85...

https://en.wikipedia.org/wiki/Transfer_entropy

It is designed for time series, but can be adopted to a more general case. The advantage over MI is that with MI, you could only see that A and B are related, while TE can show that A dives B, but not the reverse.

LittlePeter · 4 years ago

arxiv link: https://arxiv.org/abs/nlin/0001042

afiori · 4 years ago

In the article it is explained that the purpose of this coefficient is to estimate how much X is a function of Y [1](or how noisy this association is); in particular this coefficient is 1 iff X is a function of Y.

With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.

[1] this means that this coefficient is intentionally not symmetric.

anewhnaccount2 · 4 years ago

I'm also interested in this, having tried and semi-successfully used mutual information for finding associations between multinomial variables. As an even more naive question, I find the actual selection of estimators bewildering. How do I know which estimator to use for mutual information? How do I know if my chosen estimator has converged or is doing a bad job on my data? Bringing it back to the topic at hand, does the estimator presented in the paper provide good estimates for a wider variety of cases than the mutual information plug-in estimator? If so I can see it might be nice for simplicity reasons alone. Can we have different estimators for this new correlation coefficient? Any ideas what that would look like?

dcanelhas · 4 years ago

Mutual information is not trivial or even possible to estimate in many practical situations as far as i know. Example applications in robotics or computer vision, where mutual information would be useful are segmentation and denoising of unordered 3d point data, for example.

mellavora · 4 years ago

Yes, as someone mentioned above, the problem is getting the underlying distribution of the data, so you can measure SUM p_i log(p_i); this usually involves some binning, which can be tricky (and yes, I know the formula I gave is entropy not MI)

I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.

Don't try to optimize testing error performance to a value lower than the irreducible error in the system.

> counterexample = tibble(x = 1:6, y = c(1.1,-1.2,1.3,-1.4,1.5,-1.6)) > counterexample # A tibble: 6 × 2 x y <int> <dbl> 1 1 1.1 2 2 -1.2 3 3 1.3 4 4 -1.4 5 5 1.5 6 6 -1.6 > XICOR::calculateXI(xvec = counterexample$x, yvec=counterexample$y) [1] -0.2857143 > -0.2857143 == 1 [1] FALSE

shahbazac · 4 years ago

This tweet contains a visualization comparison between linear correlation and “Chatarjee correlation:”

https://twitter.com/adad8m/status/1474754752193830912?s=21

querez · 4 years ago

Since it's rank-based, it would be nice to see a comparison to Spearman's correlation instead. It's very easy to find failure cases for Pearson, so the visualization is next to meaningless, IMO.

extremelearning · 4 years ago

This new coefficient of correlation is really really awesome, and this visualization shows its value in such a beautifully simple presentation.

It would be great if someone who has Wikipedia edit privileges, can edit the Wikipedia article at [1] to describe/link how the Chatarjee's correlation coefficient solves many of the known limitation of Pearson's correlation coefficient. ;)

[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...

(especially second top-left diagram)

Easy there. New correlation coefficients get proposed all the time (eg. the introduction of the linked paper lists ~10-20 alone!). It's not a good idea to add every newly proposed coefficient to established wiki pages, just because they trend on social media. Yes, the paper looks nice, but if you read any new paper proposing a new measure, they all do! They're meant to be written that way. Let the community decide and test and discuss, and if in 10 years this new coefficient is well accepted and has proven itself, we can think about your proposed edit. Doing it before is putting the cart before the horse, and is a recipe for astroturfing.

Mathnerd314 · 4 years ago

The right place to add it would be here AFAICT: https://en.wikipedia.org/wiki/Correlation#Other_measures_of_.... The article on Pearson's shouldn't get sidetracked by discussing other types of variables. (or rather, it already has too many)

But the part that one would add would not necessarily be the definition of the coefficient ξn, but rather the interesting discussion at the beginning about what makes for a good correlation coefficient.

Deleted Comment

asdf_snar · 4 years ago

I once attended a summer school in Saint-Flour, France, where Sourav Chatterjee gave a masterful exposition of results on large deviations for random graphs. All chalk talk, clear presentation. My impression is he's a deep thinker. On that basis alone I give such an article much more weight than the plethora of articles that pass before the eyes; reading the masters has always been a good rule of thumb.

rootveg · 4 years ago

what kind of summer school was that?

This one: https://recherche.math.univ-bpclermont.fr/stflour/

malshe · 4 years ago

This article is now published in the Journal of the American Statistical Association [Volume 116, 2021 - Issue 536]

https://doi.org/10.1080/01621459.2020.1758115

mjcohen · 4 years ago

That costs $47 (US). The archiv article is free.

ateng · 4 years ago

I think pointing out that it is published on a known journal is to display the fact that the paper is now peer-reviewed

derbOac · 4 years ago

Also worth looking at the cited and related paper with the same author:

https://arxiv.org/abs/1910.12327

This follow-up paper presents a related measure of conditional dependence that has a "natural interpretation as a nonlinear generalization of the familiar partial R2 statistic for measuring conditional dependence by regression."

The follow-up paper also provides some additional interpretive insights, I think.

My intuitive impression is that both of these are important developments.

I also have a very vague suspicion, based on the form of the function, that the correlation measure has some interpretation in terms of mutual information involving rank transformations of random variables.

Thanks for finding this article. I agree, these are important developments in particular because so many econometric models are now using machine learning without any distributional assumptions. Using correlation coefficients based on linearity is grossly insufficient.

pmayrgundter · 4 years ago

I didn't get the idea of ranking. But it's simple:

"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."

https://en.m.wikipedia.org/wiki/Ranking

Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.

But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.

But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.

cornel_io · 4 years ago

jmount has an explanation elsewhere in this thread linking to https://win-vector.com/2021/12/26/how-to-read-sourav-chatter... which does a great job of explaining the intuition, but in a nutshell the normalization factor of 3 comes from the fact that if you select two random points between 0 and 1, the mean distance between will be 1/3 (which is pretty easy to write down and solve, boils down the the fact that a pyramid of height 1 that's 1x1 at the base has volume = 1/3).

Thanks! That did it

tdullien · 4 years ago

zaptheimpaler · 4 years ago

How is it possible to make a general coefficient of correlation that works for any non-linear relationship? Say if y=sha256(x), doesn't that mean y is a predictable function of x, but its statistically impossible to tell from looking at inputs/outputs alone?

cirpis · 4 years ago

Two things: 1. The result is asymptotical, i.e. holds as number of samples approach infinity.

2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.

In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.

Sniffnoy · 4 years ago

No. If you have a good hash function, that means it's computationally infeasible to determine anything about x based only on y. It's not statistically impossible at all; "statistically" doesn't concern itself with computational difficulties.

This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.

hinoki · 4 years ago

The summary says that it works if it is a measurable function [0], which is structure preserving. So sha256 would scramble the input too much for it to be detected here.

[0] https://en.m.wikipedia.org/wiki/Measurable_function

kkylin · 4 years ago

I've not yet read the article, just the abstract. But the abstract is pretty precise, and being "measurable" is a very weak constraint. For (computationally) complex functions like a hash, my first guess is the "escape clause" is in the number of samples needed for the statistic to converge to its expected value.

nnm · 4 years ago

Any function that can be implemented in a computer is a Lebesgue measurable function.

Hermel · 4 years ago

Simple: if in the whole sample, the same x always comes with the same y, then y is a function of x.

Example:

x1 = 1, y1 = 2

x2 = 1.1, y2 = 2

Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.

I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.

roenxi · 4 years ago

Either Chatterjee has made a mistake, that is wrong or that definition has some extremely precise meanings because:

& I assume you could get data like that in the wild sampling a sin wave as a timeseries.

I think the abstract is a bit strong, it probably means "converges to" for a large repeating sample.

canjobear · 4 years ago

Seems that by this logic if you don’t have any repeats in your x values then you are bound to conclude that any set of y values is a function of the x.

meiji163 · 4 years ago

In Theorem 1.1, f is a function of random variables, which might be where you're confused.

> doesn't that mean y is a predictable function of x

Sort of: as function of real numbers, sha256 is just some deterministic function. But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation

This works for hash function as hash function does not introduce uncertainty, i.e. Y = f(X), not Y = f(x) + random_noise. I just tested it:

https://ibb.co/nCXVSqB

idealmedtech · 4 years ago

Perhaps they mean continuous or differentiable functions (evident by the desire to put an order on the elements)

edit: was wrong about sha256

bawolff · 4 years ago

sha256() doesn't involve state or decision making.

dandanua · 4 years ago

That's right, the paper has practical estimates of convergence only for continuous functions (kind of). For hash functions this coefficient is useless.