# For reproducibility
set.seed(1337)
# Generate some random data
= rnorm(10000,3,2)
x = rnorm(10000,1,4) y
Intro
Simple Linear Regression (SLR) has been tickled to death. One interesting tidbit about SLR is that of the different Sum of Squares formulations that exist and how they tie into just about everything. This posts tries to deconstruct the sum of squares formulations into alternative equations.
Definitions
In the least technical terms possible….
Sum of Squares provides a measurement of the total variability of a data set by squaring each point and then summing them.
\[ \sum\limits_{i = 1}^n {x_i^2} \]
More often, we use the Corrected Sum of Squares, which compares each data point to the mean of the data set to obtain a deviation and then square it.
\[ \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } \]
, where the mean is defined as: \[ \bar x = \frac{1}{n}\sum\limits_{i = 1}^n { {x_i} } \]
When we talk about Sum of Squares it will always be the later definition. Why? Well, using the initial definition is sure to cause a data overflow when working with large number (e.g. 1000000000000^2 vs. (1000000000 - 1000000)^2).
Arrangements
There are three key equations:
- Sum of Squares over \(x\): \[ {S_{xx} } = \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } \]
- Sum of Squares over \(y\): \[ {S_{yy} } = \sum\limits_{i = 1}^n { { {\left( { {y_i} - \bar y} \right)}^2} } \]
- Sum of \(x\) times \(y\): \[ {S_{xy} } = \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} \]
Psst… The last one isn’t a square! In fact, it’s part of what’s called covariance. It’s listed here because of the similarities in manipulations that you will see later on.
These initial arrangements can be modified to take on different forms such as:
\[ \begin{align*} {S_{xx} } &= \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n{ {\bar x}^2} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){x_i} } \notag \\ \end{align*} \]
and
\[ \begin{align*} {S_{xy} } &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){y_i} } \notag \\ &= \sum\limits_{i = 1}^n {\left( { {y_i} - \bar y} \right){x_i} } \notag \\ \end{align*} \]
The next two sections go into depth on how to manipulate these equations. The main point behind manipulating these equations is the use of the mean definition and some series properties.
Providing different forms of the Sum of Squares for \(S_{xx}\) and \(S_{yy}\)
These arrangements can be modified rather nicely to alternative expressions.
For instance, both 1 and 2 can be modified to be:
\[ \begin{align} {S_{xx} } &= \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } & \text{Definition} \notag \\ &= \sum\limits_{i = 1}^n {\left( {x_i^2 - 2{x_i}\bar x + { {\bar x}^2} } \right)} & \text{Expand the square} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2\bar x\sum\limits_{i = 1}^n { {x_i} } + { {\bar x}^2}\sum\limits_{i = 1}^n 1 & \text{Split Summation} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2\bar x\sum\limits_{i = 1}^n { {x_i} } + \underbrace {n{ {\bar x}^2} }_{\sum\limits_{i = 1}^n c = n \cdot c} & \text{Separate the summation} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2\bar x\left[ {n \cdot \frac{1}{n} } \right]\sum\limits_{i = 1}^n { {x_i} } + n{ {\bar x}^2} & \text{Multiple by 1} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2\bar xn \cdot \underbrace {\left[ {\frac{1}{n}\sum\limits_{i = 1}^n { {x_i} } } \right]}_{ = \bar x} + n{ {\bar x}^2}& \text{Group terms for mean} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2\bar xn\bar x + n{ {\bar x}^2} & \text{Substitute the mean} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - 2n{ {\bar x}^2} + n{ {\bar x}^2} & \text{Rearrange terms} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n{ {\bar x}^2} & \text{Simplify} \\ \end{align} \]
We’ll call this result the alternative definition.
We can further manipulate this expression…
\[ \begin{align} {S_{xx} } &= \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } & \text{Definition} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n{ {\bar x}^2} & \text{Previous result} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n{\left( {\frac{1}{n}\sum\limits_{i = 1}^n { {x_i} } } \right)^2} & \text{Substitute mean} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n \cdot \frac{1}{ { {n^2} } } \cdot \sum\limits_{i = 1}^n { {x_i} } \cdot \sum\limits_{i = 1}^n { {x_i} } & \text{Square terms} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - \frac{1}{n}\sum\limits_{i = 1}^n { {x_i} } \cdot \sum\limits_{i = 1}^n { {x_i} } & \text{Simplify} \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - \bar x\sum\limits_{i = 1}^n { {x_i} } & \text{Substitute mean} \notag \\ &= \sum\limits_{i = 1}^n {\left( {x_i^2 - \bar x{x_i} } \right)} & \text{One summation} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} \cdot {x_i} - \bar x \cdot {x_i} } \right)} & \text{Expand} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){x_i} } & \text{Factor} \\ \end{align} \]
The last result we’ll refer to as the exterior definition.
Therefore, as stated previously, we have:
\[ \begin{align*} {S_{xx} } &= \sum\limits_{i = 1}^n { { {\left( { {x_i} - \bar x} \right)}^2} } \notag \\ &= \sum\limits_{i = 1}^n {x_i^2} - n{ {\bar x}^2} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){x_i} } \notag \\ \end{align*} \]
Psst… For \(S_{yy}\), simply replace every \(x\) you see above with a \(y\).
e.g.
\[ \begin{align*} {S_{yy} } &= \sum\limits_{i = 1}^n { { {\left( { {y_i} - \bar y} \right)}^2} } \notag \\ &= \sum\limits_{i = 1}^n {y_i^2} - n{ {\bar y}^2} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {y_i} - \bar y} \right){y_i} } \notag \\ \end{align*} \]
Exploring the different forms of \(S_{xy}\)
Based on the previous section, what comes next should not be very surprising. The only real difference between these two sections is the inclusion of a different variable AND the fact that the number of observations between \(x\) and \(y\) are the same (e.g. \(n_{x} = n_{y} = n\)).
\[ \begin{align} {S_{xy} } &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} & \text{Definition} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i}{y_i} - {x_i}\bar y - \bar x{y_i} + \bar x\bar y} \right)} & \text{Expand Square} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar y\sum\limits_{i = 1}^n { {x_i} } - \bar x\sum\limits_{i = 1}^n { {y_i} } + \bar x\bar y\sum\limits_{i = 1}^n 1 & \text{Split Summation} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar y\sum\limits_{i = 1}^n { {x_i} } - \bar x\sum\limits_{i = 1}^n { {y_i} } + n\bar x\bar y & \text{Simplify} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar y \cdot \left[ {n \cdot \frac{1}{n} } \right] \cdot \sum\limits_{i = 1}^n { {x_i} } - \bar x \cdot \left[ {n \cdot \frac{1}{n} } \right] \cdot \sum\limits_{i = 1}^n { {y_i} } + n\bar x\bar y & \text{Multiply by 1} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar yn\left[ {\frac{1}{n} \cdot \sum\limits_{i = 1}^n { {x_i} } } \right] - \bar xn\left[ {\frac{1}{n} \cdot \sum\limits_{i = 1}^n { {y_i} } } \right] + n\bar x\bar y & \text{Identify Means} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar yn\bar x - \bar xn\bar y + n\bar x\bar y & \text{Substitute Means} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y - n\bar x\bar y + n\bar x\bar y & \text{Rearrange Terms} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y & \text{Simplify} \\ \end{align} \]
The result is a modified verison of the alternative definition.
We can obtain the similar form as the previous section, except this time we must choose to either have \(y_i\) or \(x_i\) on the exterior… Let’s start by opting for \(y_i\) on the exterior:
\[ \begin{align} {S_{xy} } &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} & \text{Definition} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y & \text{Previous Result} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\left[ {\frac{1}{n}\sum\limits_{i = 1}^n { {y_i} } } \right] & \text{Substitute Mean} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar x\sum\limits_{i = 1}^n { {y_i} } & \text{Simplify} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} \cdot {y_i} - \bar x \cdot {y_i} } \right)} & \text{One summation} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){y_i} } & \text{Factor} \notag \\ \end{align} \]
Alternatively, we can go the opposite route and have \(x_i\) on the exterior:
\[ \begin{align} {S_{xy} } &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} & \text{Definition} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y & \text{Previous Result} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\left[ {\frac{1}{n}\sum\limits_{i = 1}^n { {x_i} } } \right]\bar y & \text{Substitute Mean} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - \bar y\sum\limits_{i = 1}^n { {x_i} } & \text{Simplify} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} \cdot {y_i} - {x_i} \cdot \bar y} \right)} & \text{One Summation} \notag \\ &= \sum\limits_{i = 1}^n {\left( { {y_i} - \bar y} \right){x_i} } & \text{Factor} \notag \\ \end{align} \]
Both are results from the exterior definition.
Therefore, we have the following equations:
\[ \begin{align*} {S_{xy} } &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right)\left( { {y_i} - \bar y} \right)} \notag \\ &= \sum\limits_{i = 1}^n { {x_i}{y_i} } - n\bar x\bar y \notag \\ &= \sum\limits_{i = 1}^n {\left( { {x_i} - \bar x} \right){y_i} } \notag \\ &= \sum\limits_{i = 1}^n {\left( { {y_i} - \bar y} \right){x_i} } \notag \\ \end{align*} \]
A simple test
The above manipulation can be further scrutinized by seeing if it is accurate. To do so, let’s quickly right a few R functions to check the output.
Let’s formulize the definitions.
# Sxx and Syy definition
= function(x){
s.xx sum((x-mean(x))^2)
}
# Sxx and Syy Definition definition
= function(x){
s.xx.alt = length(x)
n sum(x^2) - n*mean(x)^2
}
# Sxx and Syy Exterior definition
= function(x){
s.xx.ext sum((x-mean(x))*x)
}
# Sxy Definition
= function(x,y){
s.xy sum((x-mean(x))*(y-mean(y)))
}
# Sxy Alternative Definition
= function(x,y){
s.xy.alt = length(x)
n sum(x*y) - n*mean(x)*mean(y)
}
# Sxy Exterior Definition
= function(x,y){
s.xy.ext sum((x-mean(x))*y)
}
Now, let’s see the results of each function:
### Sxx and Syy
# All give the same value for Sxx Definition?
all.equal(s.xx(x), s.xx.alt(x), s.xx.ext(x))
[1] TRUE
# What is the value?
s.xx.ext(x)
[1] 40066.65
### Sxy
# All give the same value for Sxy Definition?
all.equal(s.xy(x,y), s.xy.alt(x,y), s.xy.ext(x,y))
[1] TRUE
# What is the value?
s.xy.ext(x,y)
[1] 330.3306
Timing
Aside from the derivations and the simple tests, there is one other item to consider… The amount of time it takes to calculate each equation.
# install.packages("microbenchmark")
# Load microbenchmark
library(microbenchmark)
# Benchmark Sxx definition against x data
microbenchmark(s.xx(x), s.xx.alt(x), s.xx.ext(x))
Warning in microbenchmark(s.xx(x), s.xx.alt(x), s.xx.ext(x)): less accurate
nanosecond times to avoid potential integer overflows
Unit: microseconds
expr min lq mean median uq max neval cld
s.xx(x) 37.884 41.205 51.41113 42.0660 44.9770 772.850 100 a
s.xx.alt(x) 37.720 38.499 52.41809 39.7085 43.3575 1083.917 100 a
s.xx.ext(x) 36.900 40.508 43.76094 41.0820 45.1000 91.020 100 a
# Benchmark Syy definition against y data
microbenchmark(s.xx(y), s.xx.alt(y), s.xx.ext(y))
Unit: microseconds
expr min lq mean median uq max neval cld
s.xx(y) 40.262 41.2255 44.86056 43.7675 46.7810 84.009 100 a
s.xx.alt(y) 37.802 38.4170 41.55309 39.6060 44.7925 56.539 100 b
s.xx.ext(y) 39.811 40.7540 44.00858 43.1525 45.6945 64.083 100 a
# Benchmark Sxy Definition
microbenchmark(s.xy(x,y), s.xy.alt(x,y), s.xy.ext(x,y))
Unit: microseconds
expr min lq mean median uq max neval cld
s.xy(x, y) 54.079 62.6275 79.77370 66.0715 70.2125 1298.880 100 a
s.xy.alt(x, y) 50.225 54.2840 69.53354 55.0630 60.0240 1268.868 100 ab
s.xy.ext(x, y) 35.875 40.0570 43.39194 40.7950 45.8175 61.500 100 b
In this case, we see that for the \(S_{xx}\) and \(S_{yy}\) the alternative definition is best whereas if we have \(S_{xy}\) then the best speed is from the exterior definition.