Linear Regression Simulation Study

In general, linear regression is a linear approach of modelling the relationship of a numerical response (dependent) variable and one or more explanatory (independent) variables.

Theories

Model Assumptions

Let:

  • yiRn as the measured response for the ith subject.
  • xi=(1,xi1,...,xip)Rp+1, i=1,2,...,n, be the values of the explanatory values for the ith subject. The first entry is used for the intercept term.

Using as transpose operator, the linear regression model assumes that:

Yi=xiβ+ϵi, i=1,...,nY=Xβ+ϵ

where:

  • β=(β0,β1,...,βp)Rp+1 is the unknown regression coefficient vector.
  • Design matrix XRn×p+1 have xi at the ith row.
  • Y=(Y1,...,Yn) is a random vector.
  • ϵ=(ϵ1,ϵ2,...,ϵn)i.i.d.(0,σ2), are i.i.d error terms from unknown random distribution with mean 0 and unknown fixed variance σ2.
    • cov(ϵi,ϵi)=var(ϵi)=σ2
    • Due to i.i.d. assumption, cov(ϵi, ϵj)=0, ij
    • Thus, the diagonal elements of var(ϵ) are equal to σ2 and the off-diagonal elements are zero. var(ϵ)=[σ2000σ2000σ2]
    • We can write var(ϵ)=σ2In, where In is a n×n identity matrix, with ones on its diagonal and zero everywhere else.

Note that no assumptions are imposed onto the design matrix X: the values can be chosen by experimenter or they could be randomly realised from any distributions.

Ordinary Least Square Estimator (OLS) of β

Assuming that (XX)1 exists, that is (XX) is invertible, the ordinary least square estimator of β is define by minimizing the error terms, the ϵ=YXB. Specifically, we are to minimizing the squared error terms.

β^=argminb  Rp min(YXb) (YXb)

It can be shown that the solution is β^=(XX)1XY given that the (XX)1 exists.

Estimator of σ2

An unbiased estimator of σ2 is σ^ϵ2=1npi=1nϵ2=1np(YXβ^) (YXβ^)

Simulation

Let’s simulate the following linear model: yi=β0+β1xi1+β2xi2+ϵi where i=1,2,...,100, ϵN(0,σ2), and σ2=4

First, create the vector of Y. The parameters are set as β=(10,2,0.5); n=1000; σ=2

set.seed(001)
n = 1000; p = 2; sigma = 4;

# Betas
b0 = 10
b1 = 2
b2 = 0.5
beta = c(b0, b1,b2)

# Xs
x1 = runif(n, 0, 10)
x2 = rnorm(n, mean = 4, sd = 10)
xmat = cbind(rep(1,n), x1, x2)

# errors
eps = rnorm(n,0,sigma)

# calculate Y using linear model
y = xmat %*% beta + eps

#bind data
df_lm = as.data.frame(cbind(y,xmat))
colnames(df_lm) = c("y", "intercept", "x1", "x2")
head(df_lm)
##          y intercept       x1        x2
## 1 21.09686         1 2.655087  4.773031
## 2 14.25688         1 3.721239  1.031314
## 3 21.11518         1 5.728534 -7.832422
## 4 26.45658         1 9.082078  4.112927
## 5 23.14745         1 2.016819 13.916010
## 6 37.20973         1 8.983897 19.939675

Visualize data points

Let’s visualize the data.

library(plotly)
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
        x = ~x1,
        y = ~x2,
        z = ~y,
        type = "scatter3d", 
        mode = "markers",
        marker = list(color = pcolor, 
                      opacity = 0.4))

Figure 1: Data points generated from an underlying linear model.

Estimating β and ϵ

The βs can be calculated through two ways: using formulas and using R function.

# Using formula
beta_form = qr.solve(crossprod(xmat)) %*% crossprod(xmat,y)
# USing r function
beta_r = lm.fit(x = xmat, y = y)$coefficients
compare = cbind(beta_form, beta_r, beta)
colnames(compare)= c("formula", "r_function", "actual")
compare
##      formula r_function actual
##    9.8791689  9.8791689   10.0
## x1 2.0064254  2.0064254    2.0
## x2 0.5127404  0.5127404    0.5

The calculated (estimated) value for the βs are similar to the one that was used to generate the model, that is 10, 2, and 0.5 for β0, β1 and β2 respectively.

The realization of σ^ϵ2 is:

eps_hat = 1/(n - p)*sum((y - xmat %*% beta_form)^2)
eps_hat
## [1] 17.76159

The original σ2 is 16 and the realized value is 17.7615874.

Visualize fitted plane

So what’s is the fitted plane on the data points? Let’s visualize it.

library(reshape2)
# predict over sensible grid of values
x1s <- seq(min(x1), max(x1), length.out = 100)
x2s <- seq(min(x2), max(x2), length.out = 100)
surface <- expand.grid(x1 = x1s,x2 = x2s,KEEP.OUT.ATTRS = F)

# Making prediction
surface$y <- apply(surface[,c("x1", "x2")], 1, function(x) cbind(1,x[1], x[2]) %*% beta_form)
surface <- acast(surface, x2 ~ x1, value.var = "y")

pcolor = rep("purple", length(x1))
plot_ly(df_lm,
        x = ~x1,
        y = ~x2,
        z = ~y,
        type = "scatter3d", 
        mode = "markers",
        marker = list(color = pcolor, 
                      opacity = 0.7)) %>%
  add_trace(z = surface,
            x = x1s,
            y = x2s,
            type = "surface",
            opacity = 0.6)
02040y

Figure 2: Fitted plane on the data cloud.

We can see that the plane fitted from β^ follows the data points pretty closely. However, note that Linear models are very rigid (i.e., a plane).

Next
Previous