In general, linear regression is a linear approach of modelling the relationship of a numerical response (dependent) variable and one or more explanatory (independent) variables.

# Theories

## Model Assumptions

Let:

- \(y_i \in \mathbb{R}^n\) as the measured response for the \(i\)th subject.

- \(x_i = (1, x_{i1}, ..., x_{ip}) \in \mathbb{R}^{p+1}\), \(i = 1, 2, ..., n\), be the values of the explanatory values for the \(i\)th subject. The first entry is used for the intercept term.

Using *’* as transpose operator, the linear regression model assumes that:

\[\begin{align*} Y_{i} &= x_i'\beta + \epsilon_i,\ i = 1, ..., n\\ Y &= X \beta + \epsilon \end{align*}\]

where:

- \(\beta = (\beta_0, \beta_1, ..., \beta_p)' \in \mathbb{R}^{p+1}\) is the
**unknown**regression coefficient vector. - Design matrix \(X \in \mathbb{R}^{n\times {p+1}}\) have \(x_i'\) at the \(i\)th row.

- \(Y = (Y_1, ..., Y_n)'\) is a random vector.
- \(\epsilon = (\epsilon_1, \epsilon_2, ..., \epsilon_n)' \stackrel{i.i.d.}{\sim} (0, \sigma^2)\), are i.i.d error terms from unknown random distribution with mean 0 and unknown fixed variance \(\sigma^2\).
- \(cov(\epsilon_i, \epsilon_i) = var (\epsilon_i) = \sigma^2\)
- Due to i.i.d. assumption, \(cov(\epsilon_i,\ \epsilon_j) = 0,\ i \neq j\)
- Thus, the diagonal elements of \(var(\epsilon)\) are equal to \(\sigma^2\) and the off-diagonal elements are zero. \[var(\epsilon) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix}\]
- We can write \(var(\epsilon) = \sigma^2 I_n\), where \(I_n\) is a \(n \times n\) identity matrix, with ones on its diagonal and zero everywhere else.

Note that no assumptions are imposed onto the design matrix X: the values can be chosen by experimenter or they could be randomly realised from any distributions.

## Ordinary Least Square Estimator (OLS) of \(\beta\)

Assuming that \((X'X)^{-1}\) exists, that is \((X'X)\) is invertible, the ordinary least square estimator of \(\beta\) is define by minimizing the error terms, the \(\epsilon = Y - XB\). Specifically, we are to minimizing the squared error terms.

\[\hat\beta = \underset{b\ \in\ \mathbb{R}^p}{\mathrm{argmin}}\ min(Y-Xb)'\ (Y- Xb)\]

It can be shown that the solution is \[\hat \beta = (X'X)^{-1} X'Y\] given that the \((X'X)^{-1}\) exists.

## Estimator of \(\sigma^2\)

An unbiased estimator of \(\sigma^2\) is \[\begin{align*} \hat\sigma^2_{\epsilon} &= \frac{1}{n-p} \sum\limits_{i=1}^n \epsilon^2\\ &=\frac{1}{n-p} (Y-X\hat\beta)'\ (Y- X\hat\beta) \end{align*}\]

# Simulation

Let’s simulate the following linear model: \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i\] where \(i = 1,2,...,100\), \(\epsilon \sim N(0, \sigma^2)\), and \(\sigma^2 = 4\)

First, create the vector of Y. The parameters are set as \(\beta =(10, 2, 0.5)'\); \(n = 1000\); \(\sigma = 2\)

```
set.seed(001)
n = 1000; p = 2; sigma = 4;
# Betas
b0 = 10
b1 = 2
b2 = 0.5
beta = c(b0, b1,b2)
# Xs
x1 = runif(n, 0, 10)
x2 = rnorm(n, mean = 4, sd = 10)
xmat = cbind(rep(1,n), x1, x2)
# errors
eps = rnorm(n,0,sigma)
# calculate Y using linear model
y = xmat %*% beta + eps
#bind data
df_lm = as.data.frame(cbind(y,xmat))
colnames(df_lm) = c("y", "intercept", "x1", "x2")
head(df_lm)
```

```
## y intercept x1 x2
## 1 21.09686 1 2.655087 4.773031
## 2 14.25688 1 3.721239 1.031314
## 3 21.11518 1 5.728534 -7.832422
## 4 26.45658 1 9.082078 4.112927
## 5 23.14745 1 2.016819 13.916010
## 6 37.20973 1 8.983897 19.939675
```

## Visualize data points

Let’s visualize the data.

```
library(plotly)
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
x = ~x1,
y = ~x2,
z = ~y,
type = "scatter3d",
mode = "markers",
marker = list(color = pcolor,
opacity = 0.4))
```

## Estimating \(\beta\) and \(\epsilon\)

The \(\beta\)s can be calculated through two ways: using formulas and using R function.

```
# Using formula
beta_form = qr.solve(crossprod(xmat)) %*% crossprod(xmat,y)
# USing r function
beta_r = lm.fit(x = xmat, y = y)$coefficients
compare = cbind(beta_form, beta_r, beta)
colnames(compare)= c("formula", "r_function", "actual")
compare
```

```
## formula r_function actual
## 9.8791689 9.8791689 10.0
## x1 2.0064254 2.0064254 2.0
## x2 0.5127404 0.5127404 0.5
```

The calculated (estimated) value for the \(\beta\)s are similar to the one that was used to generate the model, that is `10, 2, and 0.5`

for \(\beta_0\), \(\beta_1\) and \(\beta_2\) respectively.

The realization of \(\hat\sigma^2_{\epsilon}\) is:

```
eps_hat = 1/(n - p)*sum((y - xmat %*% beta_form)^2)
eps_hat
```

`## [1] 17.76159`

The original \(\sigma^2\) is 16 and the realized value is 17.7615874.

## Visualize fitted plane

So what’s is the fitted plane on the data points? Let’s visualize it.

```
library(reshape2)
# predict over sensible grid of values
x1s <- seq(min(x1), max(x1), length.out = 100)
x2s <- seq(min(x2), max(x2), length.out = 100)
surface <- expand.grid(x1 = x1s,x2 = x2s,KEEP.OUT.ATTRS = F)
# Making prediction
surface$y <- apply(surface[,c("x1", "x2")], 1, function(x) cbind(1,x[1], x[2]) %*% beta_form)
surface <- acast(surface, x2 ~ x1, value.var = "y")
pcolor = rep("purple", length(x1))
plot_ly(df_lm,
x = ~x1,
y = ~x2,
z = ~y,
type = "scatter3d",
mode = "markers",
marker = list(color = pcolor,
opacity = 0.7)) %>%
add_trace(z = surface,
x = x1s,
y = x2s,
type = "surface",
opacity = 0.6)
```

We can see that the plane fitted from \(\hat\beta\) follows the data points pretty closely. However, note that Linear models are very rigid (i.e., a plane).