Intro
Below are a few proofs regarding the least square derivation associated with multiple linear regression (MLR). These proofs are useful for understanding where MLR algorithm originates from. In particular, if one aims to write their own implementation, these proofs provide a means to understand:
- What logic is being used?
- How does the logic apply in a procedural form?
- Why is this logic present?
Multiple Linear Regression (MLR) Definition
Formula
\[\begin{aligned} {y_i} &= {\beta _0} + {\beta _1}{x_{i,1} } + {\beta _2}{x_{i,2} } + \cdots {\beta _{p - 1} }{x_{i,p - 1} } + {\varepsilon _i} \\ {Y_{n \times 1} } &= {X_{n \times p} }{\beta _{p \times 1} } + {\varepsilon _{n \times 1} } \end{aligned}\]
Responses:
\[\mathbf{y}={\left( {\begin{array}{*{20}{c} } { {y_1} } \\ \vdots \\ { {y_n} } \end{array} } \right)_{n \times 1} }\]
Errors:
\[\varepsilon={\left( {\begin{array}{*{20}{c} } { {\varepsilon _1} } \\ \vdots \\ { {\varepsilon _n} } \end{array} } \right)_{n \times 1} }\]
Design Matrix:
\[X = {\left( {\begin{array}{*{20}{c} } 1&{ {x_{1,1} } }& \cdots &{ {x_{1,p - 1} } } \\ \vdots & \vdots &{}& \vdots \\ 1&{ {x_{n,1} } }& \cdots &{ {x_{n,p - 1} } } \end{array} } \right)_{n \times p} }\]
Parameters:
\[\beta = {\left( {\begin{array}{*{20}{c} } { {\beta _0} } \\ { {\beta _1} } \\ \vdots \\ { {\beta _{p - 1} } } \end{array} } \right)_{p \times 1} }\]
Notes:
Symbols used above have the following meaning:
- n: number of observations
- p: number of variables
- X: design matrix
- y: response vector
- \(\beta\): parameter or coefficient vector
- \(\varepsilon\): random error vector
Within the scalar parameterization – given by \(y_i\) – the design matrix term \(x_i\) goes up to column \(p - 1\) since one column of the design matrix is allocated to contain 1’s for \(\beta_0\) – the intercept.
Some versions may instead display the design matrix as \(p + 1\), which would mean there are \(p\) variables and an intercept. Thus, the rows of the design matrix would be structured as:
\[{\left[ {\begin{array}{*{20}{c} } 1&{ {x_{1,1} } }& \cdots &{ {x_{1,p} } } \end{array} } \right]_{1 \times \left( {p + 1} \right)} }\]
Refresh of Matrix Derivatives
Before beginning, it is helpful to know about matrix differentiation. So, let’s quickly go over two differentiation rules for matrices that will be employed next.
Consider vectors \[\mathbf{a}_{p \times 1}\] and \[\mathbf{b}_{p \times 1}\], then the derivative with respect to \[\mathbf{b}\] of the product is given as:
\[\frac{ {\partial {\mathbf{a}^T}\mathbf{b} } }{ {\partial \mathbf{b} } } = \frac{ {\partial {\mathbf{b}^T}\mathbf{a} } }{ {\partial \mathbf{b} } } = \mathbf{a}\]
Now, consider the quadratic form (\({\mathbf{b}^T}A\mathbf{b}\)) with symmetric matrix \(A_{pxp}\), then we have:
\[\frac{ {\partial {\mathbf{b}^T}A\mathbf{b} } }{ {\partial \mathbf{b} } } = 2A\mathbf{b} = 2 \mathbf{b}^T A\]
Note, if \(A\) is not symmetric, then we can use:
\[ {\mathbf{b}^T}A\mathbf{b} = \mathbf{b}^{T}\left( {\left( {A + {A^T} } \right)/2} \right)\mathbf{b}\]
Least Squares with Multiple Linear Regression (MLR)
Goal: Obtain the minimization of RSS.
\[\hat \beta = \mathop {\arg \min }\limits_\beta {\left\| {y - X\beta } \right\|^2}\]
Errors:
\[\begin{aligned} \mathbf{e} &= \mathbf{y} - \mathbf{\hat{y} } \\ &= \mathbf{y} - X\hat{\beta} \\ \end{aligned} \]
RSS Definition:
\[\begin{aligned} RSS &= {e^T}e = \left[ {\begin{array}{*{20}{c} } { {e_1} }&{ {e_2} }& \cdots &{ {e_N} } \end{array} } \right]_{1 \times n}\left[ {\begin{array}{*{20}{c} } { {e_1} } \\ { {e_2} } \\ \vdots \\ { {e_N} } \end{array} } \right]_{n \times 1} \\ &= {\left[ { {e_1} \times {e_1} + {e_2} \times {e_2} + \cdots + {e_n} \times {e_n} } \right]_{1 \times 1} } = \sum\limits_{i = 1}^n {e_i^2} \end{aligned} \]
Note: \(e \neq \varepsilon\) since \(e\) is the realization of \(\varepsilon\) from the regression procedure.
Expand RSS:
\[\begin{aligned} RSS &= {\left( {y - X\beta } \right)^T}\left( {y - X\beta } \right) \\ &= \left( { {y^T} - {\beta ^T}{X^T} } \right)\left( {y - X\beta } \right) \\ &= {y^T}y - {\beta ^T}{X^T}y - {y^T}X\beta + {\beta ^T}{X^T}X\beta \\ &= {y^T}y - {\left( { {\beta ^T}{X^T}y} \right)^T} - {y^T}X\beta + {\beta ^T}{X^T}X\beta \\ &= {y^T}y - {y^T}X\beta - {y^T}X\beta + {\beta ^T}{X^T}X\beta \\ &= {y^T}y - 2{\beta ^T}{X^T}y + {\beta ^T}{X^T}X\beta \\ \end{aligned}\]
Note:
\[\beta _{1 \times p}^TX_{p \times n}^T{y_{n \times 1} } = {\left( {\beta _{1 \times p}^TX_{p \times n}^T{y_{n \times 1} } } \right)^T} = y_{1 \times n}^T{X_{n \times p} }{\beta _{p \times 1} }\]
We are able to perform a transpose in place as the result is scalar.
Take the derivative with respect to \(\beta\):
\[\begin{aligned} RSS &= {y^T}y - 2{\beta ^T}{X^T}y + {\beta ^T}{X^T}X\beta \\ \frac{ {\partial RSS} }{ {\partial \beta } } &= - 2{X^T}y + 2{X^T}X\beta \end{aligned}\]
Set equal to zero and solve:
\[\begin{aligned} 0 &= - 2{X^T}y + 2{X^T}X\beta \\ 2{X^T}X\beta &= 2{X^T}y \\ {X^T}X\beta &= {X^T}y \\ \hat \beta &= {\left( { {X^T}X} \right)^{ - 1} }{X^T}y \end{aligned}\]
Mean of LS Estimator for MLR
Next up, let’s take the mean of the estimator!
\[\begin{aligned} E\left( {\hat \beta } \right) &= E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}y} \right] \\ &= E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\left( {X\beta + \varepsilon } \right)} \right] \\ &= E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}X\beta + { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right] \\ &= E\left[ {\beta + { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right] \\ &= \beta + E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right] \\ \end{aligned}\]
Notes:
- We substituted in the definition of \(y = X\beta + \varepsilon\) and then simplified the matrix
- \(\beta\) is a constant within the expectation and, thus, we pulled it out.
\[\begin{aligned} E\left( {\hat \beta } \right) &= \beta + E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right] \\ &= \beta + E\left[ {E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon |X} \right]} \right] \\ &= \beta + E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\underbrace {E\left[ {\varepsilon |X} \right]}_{ = 0{\text{ by model} } } } \right] \\ &= \beta \end{aligned}\]
Notes:
- Used the law of total expectation \(E\left[ X \right] = E\left[ {E\left[ {X|Y} \right]} \right]\).
- Showed that the estimator was unbiased under the exogeneity assumption that the mean of the residuals is 0.
Covariance of the LS Estimator for MLR
To perform inference, we’ll need to know the covariance matrix of \(\hat{\beta}\).
\[\begin{aligned} \operatorname{cov} \left( {\hat \beta } \right) &= E\left[ {\left( {\hat \beta - \beta } \right){ {\left( {\hat \beta - \beta } \right)}^T} } \right] \\ &= E\left[ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}y - \beta } \right){ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}y - \beta } \right)}^T} } \right] \\ &= E\left[ \begin{gathered} \left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\left( {X\beta + \varepsilon } \right) - \beta } \right) \\ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\left( {X\beta + \varepsilon } \right) - \beta } \right)^T} \\ \end{gathered} \right] \\ &= E\left[ \begin{gathered} \left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}X\beta + { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon - \beta } \right) \\ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}X\beta + { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon - \beta } \right)^T} \\ \end{gathered} \right] \\ &= E\left[ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right){ {\left( { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon } \right)}^T} } \right] \\ &= E\left[ { { {\left( { {X^T}X} \right)}^{ - 1} }{X^T}\varepsilon {\varepsilon ^T}{X} { {\left( { {X^T}X} \right)}^{ - 1} } } \right] \\ &= {\left( { {X^T}X} \right)^{ - 1} }{X^T}E\left[ {\varepsilon {\varepsilon ^T} } \right]X{\left( { {X^T}X} \right)^{ - 1} } \\ &= {\left( { {X^T}X} \right)^{ - 1} }{X^T}\operatorname{var} \left( \varepsilon \right)X{\left( { {X^T}X} \right)^{ - 1} } \\ \end{aligned}\]
Note: The above calculations are useful in multiple regression paradigms with minimal modification.
\[\begin{aligned} Cov\left( {\hat \beta } \right) &= {\left( { {X^T}X} \right)^{ - 1} }{X^T}\operatorname{var} \left( \varepsilon \right)X{\left( { {X^T}X} \right)^{ - 1} } \\ &= {\left( { {X^T}X} \right)^{ - 1} }{X^T}\left( { {\sigma ^2}{I_N} } \right)X{\left( { {X^T}X} \right)^{ - 1} } \\ &= {\sigma ^2}{\left( { {X^T}X} \right)^{ - 1} }{X^T}\left( I_N \right)X{\left( { {X^T}X} \right)^{ - 1} } \\ &= {\sigma ^2}{\left( { {X^T}X} \right)^{ - 1} }{X^T}X{\left( { {X^T}X} \right)^{ - 1} } \\ &= {\sigma ^2}{\left( { {X^T}X} \right)^{ - 1} } \\ \end{aligned}\]
Note: Under homoscedasticity, variance of the errors term is constant, assumption, we assume that \[\operatorname{var} \left( \varepsilon \right) = {\sigma ^2}{I_N}\]
Fin
Based on the above work, we have the following results…
Solutions:
\[\begin{aligned} {\hat \beta _{p \times 1} } &= \left( { {X^T}X} \right)_{p \times p}^{ - 1}X_{p \times n}^T{y_{n \times 1} } \\ E\left( {\hat \beta } \right) &= \beta_{p \times 1} \\ Cov\left( {\hat \beta } \right) &= {\sigma ^2}{\left( { {X^T}X} \right)_{p \times p}^{ - 1} } \end{aligned}\]
Freebies:
\[\begin{aligned} df &= n-p \\ {\sigma ^2} &= \frac{ { {\mathbf{e}^T}\mathbf{e} } }{ {df} } = \frac{RSS}{ {n - p} } \end{aligned}\]