Correlation coefficient

by baggy · August 26, 2011

Definition

Given two random variables X and Y, Pearson’s correlation coefficient ρ is defined as the ratio between the covariance of the two variables and the product of their standard deviations:

$\rho(X,Y) = \displaystyle{\frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y}}=\frac{E\left[ (X-\mu_X)(Y-\mu_Y) \right]}{\sigma_X \sigma_Y},$

where $\mu_X, \;\mu_Y,\; \sigma_X,\; \text{and}\; \sigma_Y$ are the means and standard deviations of X and Y. The correlation coefficient is usually used as a measure of the strength of the linear relation between the two variables. Substituting the values of the covariance and standard deviations computed from sample time series gives the sample correlation coefficient, commonly denoted as r:

$r = \displaystyle{\frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{N}(y_i-\bar{y})^2}}}$

where

$\bar{x} = \displaystyle{\frac{1}{N}\sum_{i=1}^N x_i} \; \; \text{and}\;\;\bar{y} = \frac{1}{N}\sum_{i=1}^N y_i.$

Alternatively, r can also be written as

$r = \displaystyle{\frac{N\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{N \sum x_i^2 -\big(\sum x_i\big)^2} \sqrt{N \sum y_i^2 - \big(\sum y_i\big)^2}}}$

Testing for significance

To test for the significance of the estimated correlation coefficient against the null hypothesis that the true correlation is equal to 0, one can compute the statistic

$t = \displaystyle{r\sqrt{\frac{N-2}{1-r^2}}},$

which has a Student’s t-distribution in the null case (zero correlation) with N-2 degrees of freedom. For instance in Matlab, you can compute the p-value using the function tcdf(). In particular, p = 2*tcdf(-abs(t), N-2) will give you the p-value for a two-tail t-test for a given t.

Alternatively, one can also convert the correlation coefficient using Fisher transform, given by

$F(r) = 0.5 \log \displaystyle{\frac{1 + r}{1 - r}}.$

$F(r)$ approximately follows a normal distribution with mean $F(r_0)$ and standard deviation $\frac{1}{\sqrt{N-3}}$ . With this, a z-score can now be defined as

$z = \displaystyle{\frac{F(r) - F(r_0)}{\sqrt{\frac{1}{N-3}}}}.$

Under the null hypothesis that r = r₀ and given the assumption that the sample pairs are independent and identically distributed, z follows a bivariate normal distribution. Thus an approximate p-value can be obtained from a normal probability table.

One can also use the Fisher transformation to test if two correlations r₁ and r₂ are significantly different by computing the z-score using the formula:

$z = \displaystyle{\frac{F(r_1) - F(r_2)}{\sqrt{\frac{1}{N_1 - 3} + \frac{1}{N_2 - 3}}}},$

which is distributed approximately as N(0,1) when the null hypothesis $(H_o: r_1 = r_2)$ is true. Here $N_1$ and $N_2$ are the number of samples used to compute $r_1$ and $r_2$ , respectively.

Incremental algorithm to compute the correlation coefficient

Define $S_{XY},\; S_{XX},\; \text{and}\; S_{YY}$ as follows: $S_{XY} = \sum_{i=1}^{N} ( x_i - \bar{x} )(y_i - \bar{y}),$ $S_{XX} = \sum_{i=1}^{N} (x_i - \bar{x})^2,$ and $S_{YY} = \sum_{i=1}^{N} (y_i - \bar{y})^2.$ The correlation coefficient can now be written as