Computing correlation matrix
1. Prepare the Data
Ensure your data is clean and suitable for correlation analysis:
- Collect Numerical Data: Correlation matrices require numerical variables. Ensure your data set contains at least two variables with continuous or ordinal values.
- Handle Missing Values: Remove or impute missing data points to avoid errors in computation.
- Standardize Variables (Optional): For certain correlation methods (e.g., Pearson), standardizing variables to have a mean of 0 and standard deviation of 1 may improve interpretability.
2. Choose a Correlation Method
Select the appropriate correlation coefficient based on your data and analysis goals. See Correlation measurement formulas.
3. Compute the Correlation Matrix
For a data set with variables, the correlation matrix is an symmetric matrix where:
- Diagonal elements are 1 (each variable is perfectly correlated with itself).
- Off-diagonal elements represent the correlation coefficient between variables and .
Steps:
- Calculate Pairwise Correlations: For each pair of variables , compute the correlation coefficient using the chosen method.
- Construct the Matrix: Arrange coefficients in a matrix where the element at position is .
- Verify Symmetry: Ensure , as correlation is symmetric.
4. Interpret the Results
- Range: Correlation coefficients range from to .
- : Perfect positive correlation.
- : No correlation.
- : Perfect negative correlation.
- Strength: Common thresholds (absolute values):
- : Very weak.
- : Weak.
- : Moderate.
- : Strong.
- : Very strong.
5. Optional steps
- Compute p-values to assess whether correlations are significant.
- Use heatmaps, pairplots, or scatter plots to visualize correlations for better interpretation.
Correlation measurement formulas
Pearson Correlation
Measures linear relationships between continuous variables. Assumes normality.
Variables:
- : Pearson correlation coefficient between variables and .
- : -th observation of variable .
- : Mean of variable .
- : Number of observations.
Where:
- : Observations
- : Observation means
Spearman Correlation
Non-parametric, rank-based method for monotonic relationships.
Where:
- : Difference between ranks of and
- : Number of observations
Kensdall’s Tau
Non-parametric method, suitable for small samples or ordinal data.
Where
- is the sign function TODO
- : Number of observations
- : index pairs
Covariance matrix vs correlation matrix
A correlation matrix and a covariance matrix are related but distinct.
- Covariance Matrix:
- Diagonals: Variances of variables ().
- Off-diagonals: Covariances between variables ().
- Correlation Matrix:
- Diagonals: Always 1 (since a variable’s correlation with itself is 1).
- Off-diagonals: Pearson correlation coefficients ().
where is a diagonal matrix with entries .
The correlation matrix standardizes the covariance matrix by dividing each covariance by the product of the standard deviations, resulting in dimensionless correlation coefficients (ranging from -1 to 1).
Python example
Below is an example of computing a Pearson correlation matrix using a small data set.
import numpy as np
# Sample data
data = np.array([
[1.0, 2.1, 5.0],
[2.0, 4.0, 4.0],
[3.0, 6.2, 3.1],
[4.0, 8.1, 2.0],
[5.0, 10.0, 1.0]
])
# Initialize matrix
n = data.shape[1]
corr_matrix = np.zeros((n, n))
# Compute Pearson correlations
for i in range(n):
for j in range(i, n):
x = data[:, i]
y = data[:, j]
x_mean = np.mean(x)
y_mean = np.mean(y)
num = np.sum((x - x_mean) * (y - y_mean))
denom = np.sqrt(np.sum((x - x_mean)**2) * np.sum((y - y_mean)**2))
r = num / denom if denom != 0 else 0
corr_matrix[i, j] = r
corr_matrix[j, i] = r # Symmetry
corr_matrix[i, i] = 1 # Diagonal
print("Correlation Matrix:")
print(np.round(corr_matrix, 6))Output:
Correlation Matrix:
[[ 1. 0.999659 -0.9996 ]
[ 0.999659 1. -0.998657]
[-0.9996 -0.998657 1. ]]
Interpretation
- and : Very strong positive ().
- and : Very strong negative ().
- and : Very strong negative ().
Notes
- Ensure data meets assumptions for the chosen correlation method (e.g., normality for Pearson).
- Large correlation matrices may require visualization tools like heatmaps for clarity.
- Libraries like pandas simplify computation but verify results for small or complex data sets.