Computing correlation matrix

1. Prepare the Data

Ensure your data is clean and suitable for correlation analysis:

  • Collect Numerical Data: Correlation matrices require numerical variables. Ensure your data set contains at least two variables with continuous or ordinal values.
  • Handle Missing Values: Remove or impute missing data points to avoid errors in computation.
  • Standardize Variables (Optional): For certain correlation methods (e.g., Pearson), standardizing variables to have a mean of 0 and standard deviation of 1 may improve interpretability.

2. Choose a Correlation Method

Select the appropriate correlation coefficient based on your data and analysis goals. See Correlation measurement formulas.

3. Compute the Correlation Matrix

For a data set with variables, the correlation matrix is an symmetric matrix where:

  • Diagonal elements are 1 (each variable is perfectly correlated with itself).
  • Off-diagonal elements represent the correlation coefficient between variables and .

Steps:

  1. Calculate Pairwise Correlations: For each pair of variables , compute the correlation coefficient using the chosen method.
  2. Construct the Matrix: Arrange coefficients in a matrix where the element at position is .
  3. Verify Symmetry: Ensure , as correlation is symmetric.

4. Interpret the Results

  • Range: Correlation coefficients range from to .
    • : Perfect positive correlation.
    • : No correlation.
    • : Perfect negative correlation.
  • Strength: Common thresholds (absolute values):
    • : Very weak.
    • : Weak.
    • : Moderate.
    • : Strong.
    • : Very strong.

5. Optional steps

  • Compute p-values to assess whether correlations are significant.
  • Use heatmaps, pairplots, or scatter plots to visualize correlations for better interpretation.

Correlation measurement formulas

Pearson Correlation

Measures linear relationships between continuous variables. Assumes normality.

Variables:

  • : Pearson correlation coefficient between variables and .
  • : -th observation of variable .
  • : Mean of variable .
  • : Number of observations.

Where:

  • : Observations
  • : Observation means

Spearman Correlation

Non-parametric, rank-based method for monotonic relationships.

Where:

  • : Difference between ranks of and
  • : Number of observations

Kensdall’s Tau

Non-parametric method, suitable for small samples or ordinal data.

Where

  • is the sign function TODO
  • : Number of observations
  • : index pairs

Covariance matrix vs correlation matrix

A correlation matrix and a covariance matrix are related but distinct.

  • Covariance Matrix:
    • Diagonals: Variances of variables ().
    • Off-diagonals: Covariances between variables ().
  • Correlation Matrix:
    • Diagonals: Always 1 (since a variable’s correlation with itself is 1).
    • Off-diagonals: Pearson correlation coefficients ().
      where is a diagonal matrix with entries .

The correlation matrix standardizes the covariance matrix by dividing each covariance by the product of the standard deviations, resulting in dimensionless correlation coefficients (ranging from -1 to 1).

Python example

Below is an example of computing a Pearson correlation matrix using a small data set.

import numpy as np
 
# Sample data
data = np.array([
    [1.0, 2.1, 5.0],
    [2.0, 4.0, 4.0],
    [3.0, 6.2, 3.1],
    [4.0, 8.1, 2.0],
    [5.0, 10.0, 1.0]
])
 
# Initialize matrix
n = data.shape[1]
corr_matrix = np.zeros((n, n))
 
# Compute Pearson correlations
for i in range(n):
    for j in range(i, n):
        x = data[:, i]
        y = data[:, j]
        x_mean = np.mean(x)
        y_mean = np.mean(y)
        num = np.sum((x - x_mean) * (y - y_mean))
        denom = np.sqrt(np.sum((x - x_mean)**2) * np.sum((y - y_mean)**2))
        r = num / denom if denom != 0 else 0
        corr_matrix[i, j] = r
        corr_matrix[j, i] = r  # Symmetry
    corr_matrix[i, i] = 1  # Diagonal
 
print("Correlation Matrix:")
print(np.round(corr_matrix, 6))

Output:

Correlation Matrix:
[[ 1.        0.999659 -0.9996  ]
 [ 0.999659  1.       -0.998657]
 [-0.9996   -0.998657  1.      ]]

Interpretation

  • and : Very strong positive ().
  • and : Very strong negative ().
  • and : Very strong negative ().

Notes

  • Ensure data meets assumptions for the chosen correlation method (e.g., normality for Pearson).
  • Large correlation matrices may require visualization tools like heatmaps for clarity.
  • Libraries like pandas simplify computation but verify results for small or complex data sets.