Changes

Mortaheb Sepehr · a42602f4
--- a/Generalized-Linear-Models.md
+++ b/Generalized-Linear-Models.md
+## Theory 
+
+### 1. Introduction 
+
+Classic linear models such as anova and multiple regression are known as general linear models. There are several assumptions under these models:
+
+- Response variables are assumed to be normally distributed.
+
+- Response variables have constant variance over the values of the predictor variables.
+
+- Response variables equal linear functions of predictor variables. 
+
+In this case, if the response variables are not normally distributed or don't have equal variance over the predictor variables, proper transformations should be applied on the data. However, Generalized Linear Models, which go beyond the general linear models, solve mentioned problem by: 
+
+- Allowing for non-normally distributed response variables, 
+
+- Heteroscedasticity,
+
+- non-linear relationship between the mean of the response variable and the predictor variables. 
+
+Generalized linear model (called GLM hereafter) is a general framework whose special cases include not only linear regression and anova, but also logistic regression, probit models, Poisson regression, log-linear models, and many more. 
+
+### 2. Three Components of a GLM 
+
+When constructing a generalized linear model, three major decisions must be made: 
+
+1. **Specifying the Random Component:** 
+    An appropriate probability distribution for the response variable must be chosen. This 
+    should be any member from the *natural exponential family* distributions. For the 
+    psychological and behavioral research the most important distributions are:
+
+    - Continuous variables: 
+      * Normal: <a href="https://www.codecogs.com/eqnedit.php? 
+                latex=f(y;&space;\mu,&space;\sigma^2)=\frac{1}{ 
+                \sqrt{2\pi\sigma^2}}exp(\frac{-(y-\mu)^2}{2\sigma^2})" target="_blank"><img 
+                src="https://latex.codecogs.com/svg.latex? 
+                f(y;&space;\mu,&space;\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}exp(\frac{-(y- 
+                \mu)^2}{2\sigma^2})" title="f(y; \mu, \sigma^2)=\frac{1} 
+                {\sqrt{2\pi\sigma^2}}exp(\frac{-(y-\mu)^2}{2\sigma^2})" /></a> &nbsp;&nbsp; 
+                for <img src="https://latex.codecogs.com/svg.latex?-\infty&space;<&space;y&space;<&space;\infty" title="-\infty < y < \infty" />
+
+      * Gamma: <a href="https://www.codecogs.com/eqnedit.php? 
+               latex=f(y;&space;\mu,\phi)=\frac{1}{\Gamma(\phi^{-1})}(\frac{1} 
+               {\mu\phi})^{1/\phi}y^{1/\phi-1}exp(\frac{-y}{\mu\phi})" target="_blank"><img 
+               src="https://latex.codecogs.com/svg.latex?f(y;&space;\mu,\phi)=\frac{1} 
+               {\Gamma(\phi^{-1})}(\frac{1}{\mu\phi})^{1/\phi}y^{1/\phi-1}exp(\frac{-y} 
+               {\mu\phi})" title="f(y; \mu,\phi)=\frac{1}{\Gamma(\phi^{-1})}(\frac{1} 
+               {\mu\phi})^{1/\phi}y^{1/\phi-1}exp(\frac{-y}{\mu\phi})" /></a> &nbsp;&nbsp;
+               for <a href="https://www.codecogs.com/eqnedit.php?latex=y&space;>&space;0" 
+               target="_blank"><img src="https://latex.codecogs.com/svg.latex? 
+               y&space;>&space;0" title="y > 0" /></a>
+
+      * Inverse Gaussian: <a href="https://www.codecogs.com/eqnedit.php? 
+                          latex=f(y;&space;\mu,&space;\phi)=\frac{1} 
+                          {\sqrt{2\pi&space;y^3&space;\phi}}exp(-\frac{(y-\mu)^2} 
+                          {2\mu^2\phi&space;y})" target="_blank"><img 
+                          src="https://latex.codecogs.com/svg.latex? 
+                          f(y;&space;\mu,&space;\phi)=\frac{1} 
+                          {\sqrt{2\pi&space;y^3&space;\phi}}exp(-\frac{(y-\mu)^2} 
+                          {2\mu^2\phi&space;y})" title="f(y; \mu, \phi)=\frac{1}{\sqrt{2\pi 
+                          y^3 \phi}}exp(-\frac{(y-\mu)^2}{2\mu^2\phi y})" /></a> 
+                          &nbsp;&nbsp;for <a href="https://www.codecogs.com/eqnedit.php? 
+                          latex=y&space;>&space;0" target="_blank"><img 
+                          src="https://latex.codecogs.com/svg.latex? y&space;>&space;0" 
+                          title="y > 0" /></a>
+
+    - Discrete variables: 
+      * Poisson: <a href="https://www.codecogs.com/eqnedit.php? 
+                 latex=P(y;&space;\mu)&space;=&space;\frac{e^{-\mu}\mu^y}{y!}" 
+                 target="_blank"><img src="https://latex.codecogs.com/svg.latex? 
+                 P(y;&space;\mu)&space;=&space;\frac{e^{-\mu}\mu^y}{y!}" title="P(y; \mu) = 
+                 \frac{e^{-\mu}\mu^y}{y!}" /></a> &nbsp;&nbsp; for <a 
+                 href="https://www.codecogs.com/eqnedit.php?latex=y=0,1,..." 
+                 target="_blank"><img src="https://latex.codecogs.com/svg.latex?y=0,1,..." 
+                 title="y=0,1,..." /></a>
+
+      * Bernoulli: <a href="https://www.codecogs.com/eqnedit.php? 
+                   latex=P(y;&space;\pi)=\pi^y(1-\pi)^{1-y}" target="_blank"><img 
+                   src="https://latex.codecogs.com/svg.latex?P(y;&space;\pi)=\pi^y(1- 
+                   \pi)^{1-y}" title="P(y; \pi)=\pi^y(1-\pi)^{1-y}" /></a> &nbsp;&nbsp; for 
+                   <a href="https://www.codecogs.com/eqnedit.php?latex=y=0,1" t 
+                   arget="_blank"><img src="https://latex.codecogs.com/svg.latex?y=0,1" 
+                   title="y=0,1" /></a>
+
+      * Binomial: <a href="https://www.codecogs.com/eqnedit.php? 
+                  latex=P(y;&space;\pi,n)=\binom{n}{y}\pi^y&space;(1-\pi)^{n-y}" 
+                  target="_blank"><img src="https://latex.codecogs.com/svg.latex? 
+                  P(y;&space;\pi,n)=\binom{n}{y}\pi^y&space;(1-\pi)^{n-y}" title="P(y; 
+                  \pi,n)=\binom{n}{y}\pi^y (1-\pi)^{n-y}" /></a> &nbsp;&nbsp; for <a 
+                  href="https://www.codecogs.com/eqnedit.php? 
+                  latex=y=0,1,&space;...,&space;n" target="_blank"><img 
+                  src="https://latex.codecogs.com/svg.latex?y=0,1,&space;...,&space;n" 
+                  title="y=0,1, ..., n" /></a>
+
+    We should note that this distribution is not the *true* distribution of the population, 
+    but is a, approximation of the distribution of response variable. Each natural 
+    exponential family distribution can be parameterized by *natural parameter* <a 
+    href="https://www.codecogs.com/eqnedit.php?latex=\theta" target="_blank"><img 
+    src="https://latex.codecogs.com/svg.latex?\theta" title="\theta" /></a> and *dispersion 
+    parameter* <a href="https://www.codecogs.com/eqnedit.php?latex=\phi" target="_blank"> 
+    <img src="https://latex.codecogs.com/svg.latex?\phi" title="\phi" /></a>. Parameter <a 
+    href="https://www.codecogs.com/eqnedit.php?latex=\theta" target="_blank"><img 
+    src="https://latex.codecogs.com/svg.latex?\theta" title="\theta" /></a> has information 
+    about the location of the distribution and in its basic form is a function of 
+    distribution mean <a href="https://www.codecogs.com/eqnedit.php?latex=\mu" 
+    target="_blank"><img src="https://latex.codecogs.com/svg.latex?\mu" title="\mu" /></a>. 
+    On the other hand, the variance <a href="https://www.codecogs.com/eqnedit.php? 
+    latex=\sigma^2" target="_blank"><img src="https://latex.codecogs.com/svg.latex? 
+    \sigma^2" title="\sigma^2" /></a> of the distribution is a function of <a 
+    href="https://www.codecogs.com/eqnedit.php?latex=\mu" target="_blank"><img 
+    src="https://latex.codecogs.com/svg.latex?\mu" title="\mu" /></a> and <a 
+    href="https://www.codecogs.com/eqnedit.php?latex=\phi" target="_blank"><img 
+    src="https://latex.codecogs.com/svg.latex?\phi" title="\phi" /></a>. Some important 
+    points should be noted for the mentioned distributions: 
+
+    - The gamma distribution is suitable for non-negative positively skewed response 
+      variables such as reaction times. 
+    
+    - When responses are skewed, using a gamma distribution not only implies 
+      heteroscedasticity, but also implies a specific relationship between mean and 
+      variance. 
+
+    - For a Poisson distribution, the mean is equal to the variance. In some cases where 
+      the response variable is count measures, if the distribution is symmetric we tempt to 
+      choose a normal distribution for the response variable. But we should note that using 
+      a normal distribution for a response variable that basically has Poisson distribution 
+      violates the assumption of equal variances. Because, heteroscedasticity is expected 
+      for counts. 
+
+2. **Specifying the Systematic Component (Linear Predictor):**
+    A linear combination of predictor variables should be constructed as the linear 
+    predictor of the model. This component is the fixed structural part of the model that 
+    will be used to explain systematic variability between means. If we have Q+1 predictors 
+    in our model, the linear predictor would be: 
+
+    <img src="https://latex.codecogs.com/svg.latex?\eta=\beta_0&plus;\beta_1x_1&plus;\beta_2x_2&plus;&space;...&plus;&space;\beta_Qx_Q" title="\eta=\beta_0+\beta_1x_1+\beta_2x_2+ ...+ \beta_Qx_Q" />
+
+In this linear predictor, we can also use transformation of the predictors as well as their iinteractions. 
+
+3. **Choosing a Link Function:** 
+    An appropriate function must be chosen which maps the mean of the response variable 
+    onto the linear predictor. This function should be chosen in such a way to ensure that 
+    the predicted means are in the permissible range. 
+
+    <img src="https://latex.codecogs.com/svg.latex?g(\mu)=\eta" title="g(\mu)=\eta" />
+
+    This link function should be monotonic and differentiable. As a result, the mean of the 
+    response variable would be linked to the predictor variables as: 
+
+    <img src="https://latex.codecogs.com/svg.latex?\mu&space;=&space;g^{-1}(\eta)=g^{-1}(\beta_0&plus;\beta_1x_1&plus;...&plus;\beta_Qx_Q)" title="\mu = g^{-1}(\eta)=g^{-1}(\beta_0+\beta_1x_1+...+\beta_Qx_Q)" />
+
+    The important thing here is that this link relates the mean of the response to the 
+    linear predictor and this is different from transforming the response variable. 
+
+    An important consideration in choosing a link function is whether the selected link 
+    will yield predicted values that are permissible. As an example, a common link for non- 
+    negative count data or reaction times would be the natural logarithm. When the data are 
+    not bounded, an identical function or a reciprocal function can be used. For the data 
+    that are bounded between 0 and 1 a cumulative distribution function can be chosen (such 
+    as logistic (logit function), normal (probit function), or Gumbal (log-log function) 
+    distributions). 
+
+    For the distributions of the natural exponential family, special link functions exist 
+    which are called *Canonical Link Functions*. By using the canonical link functions, we 
+    assure that the natural parameter of the distribution <img 
+    src="https://latex.codecogs.com/svg.latex?\theta" title="\theta" /> equals the linear 
+    predictor <img src="https://latex.codecogs.com/svg.latex?\eta" title="\eta" />. These 
+    functions are a good initial choice to start our modeling procedure but in some cases 
+    they may not be the best option. 
+
+In summary, a generalized linear model can be expressed as: 
+
+<img src="https://latex.codecogs.com/svg.latex?y_i&space;\sim&space;f(y|\mu_i,\phi)\\&space;\indent&space;g(\mu_i)=\eta_i\\&space;\indent&space;\eta_i=\sum_q{\beta_qx_{iq}}=\boldsymbol{\beta^{'}x}_i" title="y_i \sim f(y|\mu_i,\phi)\\ \indent g(\mu_i)=\eta_i\\ \indent \eta_i=\sum_q{\beta_qx_{iq}}=\boldsymbol{\beta^{'}x}_i" />
+
+If a GLM does not fit the data well, the considered distribution for the response variable may not be appropriate, the link function may not be appropriate, the linear predictor may not contain all relevant variables, or a combination of the mentioned problems. 
+
+### 3. GLM Parameter Estimation
+
+Typical approach to estimate GLM parameters is maximum likelihood estimation (MLE). Given the data (response variable) and the distribution considered for the data (random component of the model), a likelihood function can be constructed and parameters which maximize the likelihood function would be the solution of the estimation problem. Likelihood function is same as distribution function but with considering it as a function of parameters rather than a function of data. For example, for a Poisson distribution, the likelihood function would be: 
+
+<img src="https://latex.codecogs.com/svg.latex?L(\mu;&space;y)=\frac{e^{-\mu}\mu^y}{y!}" title="L(\mu; y)=\frac{e^{-\mu}\mu^y}{y!}" />
+
+which is the same as the probability density function of the Poisson distribution but considering a function of <img src="https://latex.codecogs.com/svg.latex?\mu" title="\mu" /> with fixed <img src="https://latex.codecogs.com/svg.latex?y" title="y" />. Suppose we have a N independent observations in a dataset. The final likelihood function would be: 
+
+<img src="https://latex.codecogs.com/svg.latex?L(\mu|y_1,&space;y_2,&space;...,&space;y_N)=\prod_{i=1}^N{\frac{e^{-\mu}\mu^{y_i}}{y_i!}}" title="L(\mu|y_1, y_2, ..., y_N)=\prod_{i=1}^N{\frac{e^{-\mu}\mu^{y_i}}{y_i!}}" />
+
+In a GLM, we should note that the mean <img src="https://latex.codecogs.com/svg.latex?\mu" title="\mu" /> is conditioned to the predictor variables via the link function: 
+
+<img src="https://latex.codecogs.com/svg.latex?\mu_i&space;=&space;g^{-1}(\boldsymbol{\beta^{'}x}_i)" title="\mu_i = g^{-1}(\boldsymbol{\beta^{'}x}_i)" />
+
+As a result, the likelihood function would be a function of regression coefficients <img src="https://latex.codecogs.com/svg.latex?\boldsymbol{\beta}" title="\boldsymbol{\beta}" />. To solve the optimization problem of finding the best <img src="https://latex.codecogs.com/svg.latex?\boldsymbol{\beta}" title="\boldsymbol{\beta}" /> usually an iterative algorithm such as Newton-Raphson or Fisher scoring is used.
+
+**Important:** In the optimization procedure, we may confront some problems such as lack of convergence, fitted values outside the permitted range, and a singular or nearly singular Hessian matrix. All of the mentioned problems can generally be solved by modifying the model. For example, an out of range estimation problem can be solved by modifying the link function. In other case, if the model has too many predictors or the predictors are highly correlated, the Hessian matrix would be singular or nearly singular. A singular Hessian problem can be detected by observing very large standard errors. To solve the problem, we should modify the linear predictor of the model. 
+
+### 4. Statistical Inference for Model Parameters 
+
+Statistical inference of parameters includes hypothesis testing and formation of confidence intervals. Here we just discuss hypothesis testing using Wald, F, and likelihood tests. 
+
+#### 4.1 Hypothesis Testing 
+
+When the dispersion parameter is known in the model and there is no need to estimate it, we can perform the Wald test and the Wald statistic would be compared to a chi-squared distribution. If the dispersion parameter should be estimated, an F statistic should be computed and and compared to the F distribution. These two tests can be applied on single models. Likelihood ratio tests are more powerful than Wald or F test but they require estimating two models. 
+
+**Wald Statistics** 
+
+It can be shown that using MLE for parameter estimation, the sampling distribution of the parameter estimates would be asymptotically multivariate normal (MVN) for the large sample sizes: 
+
+<img src="https://latex.codecogs.com/svg.latex?\boldsymbol{\hat{\beta}}&space;\sim&space;MVN(\boldsymbol{\beta,&space;\Sigma}_{\hat{\beta}})" title="\boldsymbol{\hat{\beta}} \sim MVN(\boldsymbol{\beta, \Sigma}_{\hat{\beta}})" />
+
+Using this fact, the <img src="https://latex.codecogs.com/svg.latex?q^{th}" title="q^{th}" /> parameter <img src="https://latex.codecogs.com/svg.latex?\hat{\beta}_q" title="\hat{\beta}_q" /> is normally distributed, <img src="https://latex.codecogs.com/svg.latex?\hat{\beta}_q&space;\sim&space;N(\beta,&space;\sigma_{\beta_q}^2)" title="\hat{\beta}_q \sim N(\beta, \sigma_{\beta_q}^2)" />. As a result, the Wald statistic to test the null hypothesis of <img src="https://latex.codecogs.com/svg.latex?\beta_q=\beta_q^*" title="\beta_q=\beta_q^*" /> can be calculated as: 
+
+<img src="https://latex.codecogs.com/svg.latex?Wald&space;=&space;z^2&space;=&space;(\frac{\hat{\beta_q}-\beta_q^*}{ASE_q})^2" title="Wald = z^2 = (\frac{\hat{\beta_q}-\beta_q^*}{ASE_q})^2" />
+
+where <img src="https://latex.codecogs.com/svg.latex?ASE_q" title="ASE_q" /> is the asymptomatic standard error (and is calculated during the parameter estimation procedure). The Wald statistic will then be compared to chi-square distribution with dof=1. 
+
+We can also define a more general form of Wald statistics to simultaneously test multiple hypothesis by introducing contrasts. In this case the hypothesis for <img src="https://latex.codecogs.com/svg.latex?Q^*" title="Q^*" /> simultaneous tests is: 
+
+<img src="https://latex.codecogs.com/svg.latex?H_0:&space;\boldsymbol{C}(\boldsymbol{\beta-\beta^*})=0" title="H_0: \boldsymbol{C}(\boldsymbol{\beta-\beta^*})=0" />
+
+where <img src="https://latex.codecogs.com/svg.latex?\boldsymbol{C}" title="\boldsymbol{C}" /> is a <img src="https://latex.codecogs.com/svg.latex?Q^*\times&space;Q" title="Q^*\times Q" /> matrix of constants called contrast matrix. The Wald statistics then can be calculated as: 
+
+<img src="https://latex.codecogs.com/svg.latex?Wald&space;=&space;\boldsymbol{(\hat{\beta}-\beta^*)^{'}C^{'}(C\hat{\Sigma}_{\hat{\beta}}C^{'})^{-1}C(\hat{\beta}-\beta^*)}&space;\sim&space;\chi_q^2" title="Wald = \boldsymbol{(\hat{\beta}-\beta^*)^{'}C^{'}(C\hat{\Sigma}_{\hat{\beta}}C^{'})^{-1}C(\hat{\beta}-\beta^*)} \sim \chi_q^2" />
+
+which can be compared with chi-squared distribution with dof=<img src="https://latex.codecogs.com/svg.latex?Q^*" title="Q^*" />. 
+
+**F-Tests** 
+
+For models where <img src="https://latex.codecogs.com/svg.latex?\phi" title="\phi" /> is estimated, there is extra variability due to the estimation of <img src="https://latex.codecogs.com/svg.latex?\phi" title="\phi" /> that needs to be taken into account. For a single parameter test, the statistic is the square root of the Wald test (the z score); however, the sampling distribution would be Student's t-distribution with <img src="https://latex.codecogs.com/svg.latex?dof&space;=&space;N-Q" title="dof = N-Q" />. Alternatively, we can square the test statistic (the Wald statistic) and compare the results to an F-distribution with <img src="https://latex.codecogs.com/svg.latex?\nu_1=1" title="\nu_1=1" /> and <img src="https://latex.codecogs.com/svg.latex?\nu_2=N-Q" title="\nu_2=N-Q" />. For the general form with contrasts, the F statistics would be: 
+
+<img src="https://latex.codecogs.com/svg.latex?F&space;=&space;\frac{Wald}{Q^*}" title="F = \frac{Wald}{Q^*}" />
+
+Then this statistic will be compared to an F distribution with <img src="https://latex.codecogs.com/svg.latex?\nu_1&space;=&space;Q^*" title="\nu_1 = Q^*" /> and <img src="https://latex.codecogs.com/svg.latex?\nu_2&space;=&space;N-Q" title="\nu_2 = N-Q" />. 
+
+**Likelihood Ratio Tests**
+
+Suppose that we wish to test the hypothesis that the <img src="https://latex.codecogs.com/svg.latex?Q^*" title="Q^*" /> regression coefficients all equal to zero. We can construct a model including all the coefficients (full model) and a model excluding the <img src="https://latex.codecogs.com/svg.latex?Q^*" title="Q^*" /> coefficients (nested (simple) model). Then we can calculate the LR statistic as: 
+
+<img src="https://latex.codecogs.com/svg.latex?LR&space;=&space;-2[ln(L(M_{nested}))-ln(L(M_{full}))]" title="LR = -2[ln(L(M_{nested}))-ln(L(M_{full}))]" />
+
+This statistic can then be compared with a chi-squared distribution with <img src="https://latex.codecogs.com/svg.latex?dof&space;=&space;Q^*" title="dof = Q^*" />.
\ No newline at end of file