In general, linear regression is a linear approach of modelling the relationship of a numerical response (dependent) variable and one or more explanatory (independent) variables.
Theories
Model Assumptions
Let:
- as the measured response for the th subject.
- , , be the values of the explanatory values for the th subject. The first entry is used for the intercept term.
Using ’ as transpose operator, the linear regression model assumes that:
where:
- is the unknown regression coefficient vector.
- Design matrix have at the th row.
- is a random vector.
- , are i.i.d error terms from unknown random distribution with mean 0 and unknown fixed variance .
- Due to i.i.d. assumption,
- Thus, the diagonal elements of are equal to and the off-diagonal elements are zero.
- We can write , where is a identity matrix, with ones on its diagonal and zero everywhere else.
Note that no assumptions are imposed onto the design matrix X: the values can be chosen by experimenter or they could be randomly realised from any distributions.
Ordinary Least Square Estimator (OLS) of
Assuming that exists, that is is invertible, the ordinary least square estimator of is define by minimizing the error terms, the . Specifically, we are to minimizing the squared error terms.
It can be shown that the solution is given that the exists.
Estimator of
An unbiased estimator of is
Simulation
Let’s simulate the following linear model: where , , and
First, create the vector of Y. The parameters are set as ; ;
set.seed(001)
n = 1000; p = 2; sigma = 4;
# Betas
b0 = 10
b1 = 2
b2 = 0.5
beta = c(b0, b1,b2)
# Xs
x1 = runif(n, 0, 10)
x2 = rnorm(n, mean = 4, sd = 10)
xmat = cbind(rep(1,n), x1, x2)
# errors
eps = rnorm(n,0,sigma)
# calculate Y using linear model
y = xmat %*% beta + eps
#bind data
df_lm = as.data.frame(cbind(y,xmat))
colnames(df_lm) = c("y", "intercept", "x1", "x2")
head(df_lm)
## y intercept x1 x2
## 1 21.09686 1 2.655087 4.773031
## 2 14.25688 1 3.721239 1.031314
## 3 21.11518 1 5.728534 -7.832422
## 4 26.45658 1 9.082078 4.112927
## 5 23.14745 1 2.016819 13.916010
## 6 37.20973 1 8.983897 19.939675
Visualize data points
Let’s visualize the data.
library(plotly)
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
x = ~x1,
y = ~x2,
z = ~y,
type = "scatter3d",
mode = "markers",
marker = list(color = pcolor,
opacity = 0.4))
Figure 1: Data points generated from an underlying linear model.
Estimating and
The s can be calculated through two ways: using formulas and using R function.
# Using formula
beta_form = qr.solve(crossprod(xmat)) %*% crossprod(xmat,y)
# USing r function
beta_r = lm.fit(x = xmat, y = y)$coefficients
compare = cbind(beta_form, beta_r, beta)
colnames(compare)= c("formula", "r_function", "actual")
compare
## formula r_function actual
## 9.8791689 9.8791689 10.0
## x1 2.0064254 2.0064254 2.0
## x2 0.5127404 0.5127404 0.5
The calculated (estimated) value for the s are similar to the one that was used to generate the model, that is 10, 2, and 0.5
for , and respectively.
The realization of is:
eps_hat = 1/(n - p)*sum((y - xmat %*% beta_form)^2)
eps_hat
## [1] 17.76159
The original is 16 and the realized value is 17.7615874.
Visualize fitted plane
So what’s is the fitted plane on the data points? Let’s visualize it.
library(reshape2)
# predict over sensible grid of values
x1s <- seq(min(x1), max(x1), length.out = 100)
x2s <- seq(min(x2), max(x2), length.out = 100)
surface <- expand.grid(x1 = x1s,x2 = x2s,KEEP.OUT.ATTRS = F)
# Making prediction
surface$y <- apply(surface[,c("x1", "x2")], 1, function(x) cbind(1,x[1], x[2]) %*% beta_form)
surface <- acast(surface, x2 ~ x1, value.var = "y")
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
x = ~x1,
y = ~x2,
z = ~y,
type = "scatter3d",
mode = "markers",
marker = list(color = pcolor,
opacity = 0.7)) %>%
add_trace(z = surface,
x = x1s,
y = x2s,
type = "surface",
opacity = 0.6)
Figure 2: Fitted plane on the data cloud.
We can see that the plane fitted from follows the data points pretty closely. However, note that Linear models are very rigid (i.e., a plane).