Linear Regression Simulation Study

Sat, Apr 14, 2018 8 min read R

Theories
Simulation

In general, linear regression is a linear approach of modelling the relationship of a numerical response (dependent) variable and one or more explanatory (independent) variables.

Theories

Model Assumptions

Let:

$y_{i} \in R^{n}$ as the measured response for the $i$ th subject.
$x_{i} = (1, x_{i 1}, . . ., x_{i p}) \in R^{p + 1}$ , $i = 1, 2, . . ., n$ , be the values of the explanatory values for the $i$ th subject. The first entry is used for the intercept term.

Using ’ as transpose operator, the linear regression model assumes that:

$\begin{aligned} Y_{i} & = x_{i}^{'} β + ϵ_{i}, i = 1, . . ., n \\ Y & = X β + ϵ \end{aligned}$

where:

$β = (β_{0}, β_{1}, . . ., β_{p})^{'} \in R^{p + 1}$ is the unknown regression coefficient vector.
Design matrix $X \in R^{n \times p + 1}$ have $x_{i}^{'}$ at the $i$ th row.
$Y = (Y_{1}, . . ., Y_{n})^{'}$ is a random vector.
$ϵ = (ϵ_{1}, ϵ_{2}, . . ., ϵ_{n})^{'} \overset{i . i . d .}{\sim} (0, σ^{2})$ , are i.i.d error terms from unknown random distribution with mean 0 and unknown fixed variance $σ^{2}$ .
- $c o v (ϵ_{i}, ϵ_{i}) = v a r (ϵ_{i}) = σ^{2}$
- Due to i.i.d. assumption, $c o v (ϵ_{i}, ϵ_{j}) = 0, i \neq j$
- Thus, the diagonal elements of $v a r (ϵ)$ are equal to $σ^{2}$ and the off-diagonal elements are zero. $v a r (ϵ) = [\begin{matrix} σ^{2} & 0 & \dots & 0 \\ 0 & σ^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ^{2} \end{matrix}]$
- We can write $v a r (ϵ) = σ^{2} I_{n}$ , where $I_{n}$ is a $n \times n$ identity matrix, with ones on its diagonal and zero everywhere else.

Note that no assumptions are imposed onto the design matrix X: the values can be chosen by experimenter or they could be randomly realised from any distributions.

Ordinary Least Square Estimator (OLS) of $β$

Assuming that $(X^{'} X)^{- 1}$ exists, that is $(X^{'} X)$ is invertible, the ordinary least square estimator of $β$ is define by minimizing the error terms, the $ϵ = Y - X B$ . Specifically, we are to minimizing the squared error terms.

$\hat{β} = \underset{b \in R^{p}}{a r g m i n} m i n (Y - X b)^{'} (Y - X b)$

It can be shown that the solution is $\hat{β} = (X^{'} X)^{- 1} X^{'} Y$ given that the $(X^{'} X)^{- 1}$ exists.

Estimator of $σ^{2}$

An unbiased estimator of $σ^{2}$ is $\begin{aligned} {\hat{σ}}_{ϵ}^{2} & = \frac{1}{n - p} \sum_{i = 1}^{n} ϵ^{2} \\ = \frac{1}{n - p} (Y - X \hat{β})^{'} (Y - X \hat{β}) \end{aligned}$

Simulation

Let’s simulate the following linear model: $y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + ϵ_{i}$ where $i = 1, 2, . . ., 100$ , $ϵ \sim N (0, σ^{2})$ , and $σ^{2} = 4$

First, create the vector of Y. The parameters are set as $β = (10, 2, 0.5)^{'}$ ; $n = 1000$ ; $σ = 2$

set.seed(001)
n = 1000; p = 2; sigma = 4;

# Betas
b0 = 10
b1 = 2
b2 = 0.5
beta = c(b0, b1,b2)

# Xs
x1 = runif(n, 0, 10)
x2 = rnorm(n, mean = 4, sd = 10)
xmat = cbind(rep(1,n), x1, x2)

# errors
eps = rnorm(n,0,sigma)

# calculate Y using linear model
y = xmat %*% beta + eps

#bind data
df_lm = as.data.frame(cbind(y,xmat))
colnames(df_lm) = c("y", "intercept", "x1", "x2")
head(df_lm)

##          y intercept       x1        x2
## 1 21.09686         1 2.655087  4.773031
## 2 14.25688         1 3.721239  1.031314
## 3 21.11518         1 5.728534 -7.832422
## 4 26.45658         1 9.082078  4.112927
## 5 23.14745         1 2.016819 13.916010
## 6 37.20973         1 8.983897 19.939675

Visualize data points

Let’s visualize the data.

library(plotly)
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
        x = ~x1,
        y = ~x2,
        z = ~y,
        type = "scatter3d", 
        mode = "markers",
        marker = list(color = pcolor, 
                      opacity = 0.4))

Figure 1: Data points generated from an underlying linear model.

Estimating $β$ and $ϵ$

The $β$ s can be calculated through two ways: using formulas and using R function.

# Using formula
beta_form = qr.solve(crossprod(xmat)) %*% crossprod(xmat,y)
# USing r function
beta_r = lm.fit(x = xmat, y = y)$coefficients
compare = cbind(beta_form, beta_r, beta)
colnames(compare)= c("formula", "r_function", "actual")
compare

##      formula r_function actual
##    9.8791689  9.8791689   10.0
## x1 2.0064254  2.0064254    2.0
## x2 0.5127404  0.5127404    0.5

The calculated (estimated) value for the $β$ s are similar to the one that was used to generate the model, that is 10, 2, and 0.5 for $β_{0}$ , $β_{1}$ and $β_{2}$ respectively.

The realization of ${\hat{σ}}_{ϵ}^{2}$ is:

eps_hat = 1/(n - p)*sum((y - xmat %*% beta_form)^2)
eps_hat

## [1] 17.76159

The original $σ^{2}$ is 16 and the realized value is 17.7615874.

Visualize fitted plane

So what’s is the fitted plane on the data points? Let’s visualize it.

library(reshape2)
# predict over sensible grid of values
x1s <- seq(min(x1), max(x1), length.out = 100)
x2s <- seq(min(x2), max(x2), length.out = 100)
surface <- expand.grid(x1 = x1s,x2 = x2s,KEEP.OUT.ATTRS = F)

# Making prediction
surface$y <- apply(surface[,c("x1", "x2")], 1, function(x) cbind(1,x[1], x[2]) %*% beta_form)
surface <- acast(surface, x2 ~ x1, value.var = "y")

pcolor = rep("purple", length(x1))
plot_ly(df_lm,
        x = ~x1,
        y = ~x2,
        z = ~y,
        type = "scatter3d", 
        mode = "markers",
        marker = list(color = pcolor, 
                      opacity = 0.7)) %>%
  add_trace(z = surface,
            x = x1s,
            y = x2s,
            type = "surface",
            opacity = 0.6)

Figure 2: Fitted plane on the data cloud.

We can see that the plane fitted from $\hat{β}$ follows the data points pretty closely. However, note that Linear models are very rigid (i.e., a plane).

simulation linear-regression plotly