Correlation coefficient

Definition

Given two random variables X and Y, Pearson’s correlation coefficient ρ is defined as the ratio between the covariance of the two variables and the product of their standard deviations:

\rho(X,Y) = \displaystyle{\frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y}}=\frac{E\left[ (X-\mu_X)(Y-\mu_Y) \right]}{\sigma_X \sigma_Y},

where \mu_X, \;\mu_Y,\; \sigma_X,\; \text{and}\; \sigma_Y are the means and standard deviations of X and Y. The correlation coefficient is usually used as a measure of the strength of the linear relation between the two variables. Substituting the values of the covariance and standard deviations computed from sample time series gives the sample correlation coefficient, commonly denoted as r:

r = \displaystyle{\frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{N}(y_i-\bar{y})^2}}}

where

\bar{x} = \displaystyle{\frac{1}{N}\sum_{i=1}^N x_i} \; \; \text{and}\;\;\bar{y} = \frac{1}{N}\sum_{i=1}^N y_i.

Alternatively, r can also be written as

r = \displaystyle{\frac{N\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{N \sum x_i^2 -\big(\sum x_i\big)^2} \sqrt{N \sum y_i^2 - \big(\sum y_i\big)^2}}}

Testing for significance

To test for the significance of the estimated correlation coefficient against the null hypothesis that the true correlation is equal to 0, one can compute the statistic

t = \displaystyle{r\sqrt{\frac{N-2}{1-r^2}}},

which has a Student’s t-distribution in the null case (zero correlation) with N-2 degrees of freedom. For instance in Matlab, you can compute the p-value using the function tcdf(). In particular, p = 2*tcdf(-abs(t), N-2) will give you the p-value for a two-tail t-test for a given t.

Alternatively, one can also convert the correlation coefficient using Fisher transform, given by

F(r) = 0.5 \log \displaystyle{\frac{1 + r}{1 - r}}.

F(r) approximately follows a normal distribution with mean F(r_0) and standard deviation  \frac{1}{\sqrt{N-3}}. With this, a z-score can now be defined as

z = \displaystyle{\frac{F(r) - F(r_0)}{\sqrt{\frac{1}{N-3}}}}.

Under the null hypothesis that r = r0 and given the assumption that the sample pairs are independent and identically distributed, z follows a bivariate normal distribution. Thus an approximate p-value can be obtained from a normal probability table.

One can also use the Fisher transformation to test if two correlations r1 and r2 are significantly different by computing the z-score using the formula:

 z = \displaystyle{\frac{F(r_1) - F(r_2)}{\sqrt{\frac{1}{N_1 - 3} + \frac{1}{N_2 - 3}}}},

which is distributed approximately as N(0,1) when the null hypothesis (H_o: r_1 = r_2) is true. Here N_1 and N_2 are the number of samples used to compute r_1 and r_2, respectively.

Incremental algorithm to compute the correlation coefficient

Define S_{XY},\; S_{XX},\; \text{and}\; S_{YY} as follows: S_{XY} = \sum_{i=1}^{N} ( x_i - \bar{x} )(y_i - \bar{y}),  S_{XX} = \sum_{i=1}^{N} (x_i - \bar{x})^2, and S_{YY} = \sum_{i=1}^{N} (y_i - \bar{y})^2.  The correlation coefficient can now be written as

r = \displaystyle{\frac{S_{XY}}{\sqrt{S_{XX} S_{YY} }}}.

To estimate the correlation coefficient incrementally, use the following algorithm:

Initialize n = 1:

\bar{x} = x_1, \; \bar{y} = y_1

S_{XY} = 0, \; S_{XX} = 0, \; S_{YY} = 0

for n = 2 to N, compute

\delta = (n-1)/n

\Delta_X = x_n - \bar{x}

\Delta_Y = y_n - \bar{y}

S_{XX} = S_{XX} + \delta \Delta_X \Delta_X

S_{YY} = S_{YY} + \delta \Delta_Y \Delta_Y

S_{XY} = S_{XY} + \delta \Delta_X \Delta_Y

\bar{x} = \bar{x} + \Delta_X / n

\bar{y} = \bar{y} + \Delta_Y / n

Re-compute r using the above equation

end

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.