# Simple Linear Regression

Daniel Weibel
Created 27 Feb 2016
• Given: $n$ observed data points {$(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$}
• Goal: predict $Y$ for any value of $X$
• $X$ = predictor, $Y$ = response
• “Predict a quantitative response $Y$ based on a predictor variable $X$”
• “Regress $Y$ onto $X$”
• Approach: find a model function $Y \approx \beta_0 + \beta_1 X$ that fits the $n$ data points as well as possible
• Find appropriate values for $\beta_0$ (intercept) and $\beta_1$ (slope)
• $\beta_0$ and $\beta_1$ are the coefficients of the model
• Method of least squares: assess the fitness of a model function to the data points
• Residual: {$e_i = y_i - \hat{y}_i$}, for each of the $n$ data points (assuming some values for $\beta_0$ and $\beta_1$)
• That is, the difference between the observed $Y$-value at $x_i$ and the predicted $Y$-value at $x_i$ according to the used values for $\beta_0$ and $\beta_1$
• RSS (residual sum of squares): $\text{RSS} = e_1^2 + e_2^2 + \dots + e_n^2$
• Choose $\beta_0$ and $\beta_1$ to minimise RSS:
• $\beta_0 = \overline{y} - \beta_1 \overline{x}$
• $\beta_1 = \frac{\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})}}{\sum_{i=1}^{n}{(x_i - \overline{x})^2}}$
• Standard errors of $\beta_0$ and $\beta_1$: assess how close our chosen $\beta_0$ and $\beta_1$ are to the true coefficients $\beta_0^t$ and $\beta_1^t$ of the true model function (population) $Y = \beta_0^t + \beta_1^t X + \epsilon$
• $\text{SE}(\beta_0) = \sqrt{\text{RSE}^2 \left( \frac{1}{n} + \frac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \right)}$
• $\text{SE}(\beta_1) = \sqrt{\frac{\text{RSE}^2}{\sum_{i=1}^n(x_i - \overline{x})^2}}$
• $\text{RSE} = \sqrt{\frac{\text{RSS}}{n-2}}$ (residual standard error, see below)
• 95% confidence intervals of $\beta_0^t$ and $\beta_1^t$: range of values, such that with 95% probability the true $\beta_0^t$ and $\beta_1^t$ are contained within this range
• $\beta_i \pm 2\, \text{SE}(\beta_i)$
• …that is, the 95% confident interval for $\beta_i^t$ is $\left[ \beta_i - 2\, \text{SE}(\beta_i), \; \beta_i + 2\, \text{SE}(\beta_i) \right]$
• Relationship hypothesis test: assess whether $X$ and $Y$ are indeed related ($\beta_1^t \neq 0$)
• Null hypothesis: $H_0$: There is no relationship between $X$ and $Y$ $\rightarrow$ ($\beta_1^t = 0$)
• Alternative hypothesis: $H_a$: There is a relationship between $X$ and $Y$ $\rightarrow$ ($\beta_1^t \neq 0$)
• Try to reject the null hypothesis: is our $\beta_1$ sufficiently far from zero so that the null hypothesis is false with a very high probability? Assessing this depends on the standard error of $\beta_1$
• t-statistics:
• $t = \frac{\beta_1}{\text{SE}(\beta_1)}$
• Number of standard errors that $\beta_1$ is away from zero
• p-value:
• Assuming that the null hypothesis is true ($\beta_1^t = 0$), the $p$-value is the probability that our dataset exhibits the calculated $t$-statistics (or greater)
• Can be looked up in a normal distribution table (for the given $t$-statistics value)
• $p < 0.05$: assuming that the null hypothesis is true ($\beta_1^t = 0$), the probability to observe a $t$-statistics like ours is $<$ 5% $\rightarrow$ reject null hypothesis
• $p \geq 0.05$: assuming that the null hypothesis is true ($\beta_1^t = 0$), the probability to observe a $t$-statistics like ours is $\geq$ 5% $\rightarrow$ substantial probability that null hypothesis is true $\rightarrow$ {cannot reject null hypothesis}
• Residual standard error (RSE): assess accuracy (fit) of the model
• $\text{RSE} = \sqrt{\frac{\text{RSS}}{n-2}}$
• Estimate of the standard deviation of $\epsilon$ of the true model $Y = \beta_0^t + \beta_1^t X + \epsilon$
• Average that the $Y$-values deviate from the regression line
• R$^2$-statistics: alternative to RSE whose value is always between 0 and 1
• $R^2 = \frac{\text{TSS}-\text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$
• TSS (total sum of squares): $\text{TSS} = \sum_{i=1}^n{(y_i - \overline{y})^2}$
• RSS (residual sum of squares): $\text{RSS} = \sum_{i=1}^n{(y_i - \hat{y_i})^2}$
• Proportion of the variability of the $Y$-values that is explained by the model
• 1: model explains variability perfectly $\rightarrow$ good fit
• 0: model does not explain variability at all $\rightarrow$ bad fit