- it's always fun when a new equation or formula is discovered, doubly so when it is very practical
- actually really easy to wrap your head around it
- seems very computationally efficient. Basically boils down to sorts and distance
- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.
- looks useful in both LTI and non stationary signal analysis
- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
Thus, it is not computationally efficient.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.
You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.
Also, I would say that random number generation is a well-solved problem.
Since it's rank-based, it would be nice to see a comparison to Spearman's correlation instead. It's very easy to find failure cases for Pearson, so the visualization is next to meaningless, IMO.
This new coefficient of correlation is really really awesome, and this visualization shows its value in such a beautifully simple presentation.
It would be great if someone who has Wikipedia edit privileges, can edit the Wikipedia article at [1] to describe/link how the Chatarjee's correlation coefficient solves many of the known limitation of Pearson's correlation coefficient.
;)
Easy there. New correlation coefficients get proposed all the time (eg. the introduction of the linked paper lists ~10-20 alone!). It's not a good idea to add every newly proposed coefficient to established wiki pages, just because they trend on social media. Yes, the paper looks nice, but if you read any new paper proposing a new measure, they all do! They're meant to be written that way. Let the community decide and test and discuss, and if in 10 years this new coefficient is well accepted and has proven itself, we can think about your proposed edit. Doing it before is putting the cart before the horse, and is a recipe for astroturfing.
But the part that one would add would not necessarily be the definition of the coefficient ξn, but rather the interesting discussion at the beginning about what makes for a good correlation coefficient.
I once attended a summer school in Saint-Flour, France, where Sourav Chatterjee gave a masterful exposition of results on large deviations for random graphs. All chalk talk, clear presentation. My impression is he's a deep thinker. On that basis alone I give such an article much more weight than the plethora of articles that pass before the eyes; reading the masters has always been a good rule of thumb.
This follow-up paper presents a related measure of conditional dependence that has a "natural interpretation as a nonlinear generalization of the familiar partial R2 statistic for measuring conditional dependence by regression."
The follow-up paper also provides some additional interpretive insights, I think.
My intuitive impression is that both of these are important developments.
I also have a very vague suspicion, based on the form of the function, that the correlation measure has some interpretation in terms of mutual information involving rank transformations of random variables.
Thanks for finding this article. I agree, these are important developments in particular because so many econometric models are now using machine learning without any distributional assumptions. Using correlation coefficients based on linearity is grossly insufficient.
I didn't get the idea of ranking. But it's simple:
"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."
Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.
But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.
But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.
jmount has an explanation elsewhere in this thread linking to https://win-vector.com/2021/12/26/how-to-read-sourav-chatter... which does a great job of explaining the intuition, but in a nutshell the normalization factor of 3 comes from the fact that if you select two random points between 0 and 1, the mean distance between will be 1/3 (which is pretty easy to write down and solve, boils down the the fact that a pyramid of height 1 that's 1x1 at the base has volume = 1/3).
I have a very naive question: What are the downsides of estimating mutual information instead?
I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).
Different coefficients help you look at different kinds of relationships. For example, Pearson's R tells you about linear relationships between variables -- it's closely tied to the covariance: "how useful is it to draw a line through these data points, and how accurate is interpolating likely to be?".
Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)
Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.
TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!
MI is quite useful and widely used. It typically requires binning data though when distributions are unknown / empirically estimated. This approach is a rank-based score, more similar to Spearman correlation than Pearson. This allows for nonlinear relationships between the two variables.
A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.
It is designed for time series, but can be adopted to a more general case. The advantage over MI is that with MI, you could only see that A and B are related, while TE can show that A dives B, but not the reverse.
In the article it is explained that the purpose of this coefficient is to estimate how much X is a function of Y [1](or how noisy this association is); in particular this coefficient is 1 iff X is a function of Y.
With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.
[1] this means that this coefficient is intentionally not symmetric.
I'm also interested in this, having tried and semi-successfully used mutual information for finding associations between multinomial variables. As an even more naive question, I find the actual selection of estimators bewildering. How do I know which estimator to use for mutual information? How do I know if my chosen estimator has converged or is doing a bad job on my data? Bringing it back to the topic at hand, does the estimator presented in the paper provide good estimates for a wider variety of cases than the mutual information plug-in estimator? If so I can see it might be nice for simplicity reasons alone. Can we have different estimators for this new correlation coefficient? Any ideas what that would look like?
Mutual information is not trivial or even possible to estimate in many practical situations as far as i know. Example applications in robotics or computer vision, where mutual information would be useful are segmentation and denoising of unordered 3d point data, for example.
Yes, as someone mentioned above, the problem is getting the underlying distribution of the data, so you can measure SUM p_i log(p_i); this usually involves some binning, which can be tricky (and yes, I know the formula I gave is entropy not MI)
I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.
Don't try to optimize testing error performance to a value lower than the irreducible error in the system.
How is it possible to make a general coefficient of correlation that works for any non-linear relationship? Say if y=sha256(x), doesn't that mean y is a predictable function of x, but its statistically impossible to tell from looking at inputs/outputs alone?
Two things:
1. The result is asymptotical, i.e. holds as number of samples approach infinity.
2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.
In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.
No. If you have a good hash function, that means it's computationally infeasible to determine anything about x based only on y. It's not statistically impossible at all; "statistically" doesn't concern itself with computational difficulties.
This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.
The summary says that it works if it is a measurable function [0], which is structure preserving. So sha256 would scramble the input too much for it to be detected here.
I've not yet read the article, just the abstract. But the abstract is pretty precise, and being "measurable" is a very weak constraint. For (computationally) complex functions like a hash, my first guess is the "escape clause" is in the number of samples needed for the statistic to converge to its expected value.
Simple: if in the whole sample, the same x always comes with the same y, then y is a function of x.
Example:
x1 = 1, y1 = 2
x2 = 1.1, y2 = 2
Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.
I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.
Seems that by this logic if you don’t have any repeats in your x values then you are bound to conclude that any set of y values is a function of the x.
In Theorem 1.1, f is a function of random variables, which might be where you're confused.
> doesn't that mean y is a predictable function of x
Sort of: as function of real numbers, sha256 is just some deterministic function.
But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation
That's right, the paper has practical estimates of convergence only for continuous functions (kind of). For hash functions this coefficient is useless.
- it's always fun when a new equation or formula is discovered, doubly so when it is very practical
- actually really easy to wrap your head around it
- seems very computationally efficient. Basically boils down to sorts and distance
- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.
- looks useful in both LTI and non stationary signal analysis
- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.
Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
Thus, it is not computationally efficient.
To break ties randomly one has to have good random number generator, which is a problem in itself.
Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).
But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.
For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].
[1] https://aclanthology.org/P02-1023.pdf
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:
https://en.wikipedia.org/wiki/Order_statistic_tree
But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
Also, I would say that random number generation is a well-solved problem.
Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.
https://twitter.com/adad8m/status/1474754752193830912?s=21
It would be great if someone who has Wikipedia edit privileges, can edit the Wikipedia article at [1] to describe/link how the Chatarjee's correlation coefficient solves many of the known limitation of Pearson's correlation coefficient. ;)
[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...
(especially second top-left diagram)
But the part that one would add would not necessarily be the definition of the coefficient ξn, but rather the interesting discussion at the beginning about what makes for a good correlation coefficient.
Deleted Comment
Deleted Comment
https://doi.org/10.1080/01621459.2020.1758115
https://arxiv.org/abs/1910.12327
This follow-up paper presents a related measure of conditional dependence that has a "natural interpretation as a nonlinear generalization of the familiar partial R2 statistic for measuring conditional dependence by regression."
The follow-up paper also provides some additional interpretive insights, I think.
My intuitive impression is that both of these are important developments.
I also have a very vague suspicion, based on the form of the function, that the correlation measure has some interpretation in terms of mutual information involving rank transformations of random variables.
"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."
https://en.m.wikipedia.org/wiki/Ranking
Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.
But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.
But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.
I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).
So ... what am I not understanding?
Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)
Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.
TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!
A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.
https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.85...
https://en.wikipedia.org/wiki/Transfer_entropy
It is designed for time series, but can be adopted to a more general case. The advantage over MI is that with MI, you could only see that A and B are related, while TE can show that A dives B, but not the reverse.
With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.
[1] this means that this coefficient is intentionally not symmetric.
I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.
Don't try to optimize testing error performance to a value lower than the irreducible error in the system.
2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.
In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.
This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.
[0] https://en.m.wikipedia.org/wiki/Measurable_function
Deleted Comment
Example:
x1 = 1, y1 = 2
x2 = 1.1, y2 = 2
Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.
I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.
I think the abstract is a bit strong, it probably means "converges to" for a large repeating sample.
> doesn't that mean y is a predictable function of x
Sort of: as function of real numbers, sha256 is just some deterministic function. But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation
Deleted Comment
https://ibb.co/nCXVSqB
Deleted Comment
edit: was wrong about sha256