About multivariate regression

Multivariate regression models the linear relationship between multiple independent variables () and one or more dependent variables ().

The goal is to predict using a linear combination of predictors and understand their collective impact, such as predicting house prices from features like size and age.

Computing regression coefficients

Steps to compute regression coefficients:

  1. Prepare the data matrix , adding a column of ones for the intercept.
  2. Calculate coefficients using the least squares method with Formula 10.5: , where is the transpose of , and is the inverse of .

The model should look like:

Coefficients to indicates each predictor’s effect on , holding others constant. Positive coefficients increase , negative ones decrease it.

Computing predicted values

Use the linear equation Formula 10.3: .

is matrix of predicted values.

Evaluating model performance

  1. Compute Sum of Squared Errors (SSE) using Formula 10.6.
  2. Estimate variance with Formula 10.8, where is the number of observations and is the number of predictors.
  3. Calculate R-squared using Formula 10.30 to measure variance explained.
  4. Derive Root Mean Squared Error (RMSE) as .

R-squared shows model fit, while RMSE measures prediction error.

High RMSE or low R-squared suggests an issue in data/model.

Assumptions

  • Linearity between and
  • Errors are independent, and identically distributed with zero mean and constant variance.

Core formulas

  • Formula 10.3 (Predicted values):
  • Formula 10.5 (Coefficient estimation):
  • Formula 10.6 (Sum of squared errors):
  • Formula 10.8 (Variance estimate):
  • Formula 10.30 (R-squared):

Limitations

  • Assumes linear relationships.
  • Sensitive to multicollinearity.
  • Needs for stable inversion.
  • Outliers or non-normal errors can skew results.

Python example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
# 1. Membuat dataset sintetis
np.random.seed(42)
n_samples = 100
data = {
    'Luas_Tanah': np.random.normal(200, 50, n_samples),
    'Jumlah_Kamar': np.random.randint(2, 6, n_samples),
    'Usia_Rumah': np.random.randint(1, 30, n_samples),
    'Harga_Rumah': np.zeros(n_samples)
}
 
for i in range(n_samples):
    data['Harga_Rumah'][i] = (data['Luas_Tanah'][i] * 2.5 + 
                             data['Jumlah_Kamar'][i] * 50 + 
                             data['Usia_Rumah'][i] * -10 + 
                             np.random.normal(0, 50))
 
df = pd.DataFrame(data)
 
# 2. Preprocessing
X = df[['Luas_Tanah', 'Jumlah_Kamar', 'Usia_Rumah']].values
y = df['Harga_Rumah'].values
X = np.hstack([np.ones((n_samples, 1)), X])
 
# Membagi data
np.random.seed(42)
indices = np.random.permutation(n_samples)
train_size = int(0.8 * n_samples)
train_indices = indices[:train_size]
test_indices = indices[train_size:]
 
X_train = X[train_indices]
X_test = X[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]
 
# 3. Menghitung koefisien regresi (Formula 10.5)
X_train_T = X_train.T
XtX = np.dot(X_train_T, X_train)
XtX_inv = np.linalg.inv(XtX)
Xty = np.dot(X_train_T, y_train)
beta_hat = np.dot(XtX_inv, Xty)
 
# 4. Prediksi (Formula 10.3)
y_pred_train = np.dot(X_train, beta_hat)
y_pred_test = np.dot(X_test, beta_hat)
 
# 5. Menghitung metrik evaluasi
# SSE (Formula 10.6)
SSE = np.sum((y_test - y_pred_test) ** 2)
 
# s² (Formula 10.8)
n, q = X_test.shape[0], X_test.shape[1] - 1
s_squared = SSE / (n - q - 1)
 
# R² (Formula 10.30) - Corrected
y_mean = np.mean(y_train)
n_train = len(y_train)  # Use training set size
SSR = np.dot(beta_hat.T, np.dot(X_train_T, y_train)) - n_train * y_mean**2
SSTO = np.sum((y_train - y_mean)**2)
R_squared = SSR / SSTO
 
# RMSE
RMSE = np.sqrt(SSE / n)
 
# 6. Menampilkan hasil
print("Koefisien Regresi (β̂):")
print(f"Intercept (β₀): {beta_hat[0]:.2f}")
for i, feature in enumerate(['Luas_Tanah', 'Jumlah_Kamar', 'Usia_Rumah']):
    print(f"{feature}{i+1}): {beta_hat[i+1]:.2f}")
print(f"\nMetrik Evaluasi:")
print(f"R-squared (R²): {R_squared:.2f}")
print(f"Root Mean Squared Error (RMSE): {RMSE:.2f}")
 
# 7. Visualisasi hasil
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_test, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Harga Aktual')
plt.ylabel('Harga Prediksi')
plt.title('Harga Rumah Aktual vs Prediksi (Manual Calculation)')
plt.tight_layout()
plt.savefig('house_price_prediction_manual_corrected.png')
plt.close()
 
# 8. Prediksi untuk data baru
new_house = np.array([1, 250, 4, 5])
predicted_price = np.dot(new_house, beta_hat)
print(f"\nPrediksi harga untuk rumah baru: ${predicted_price:.2f}")