Chapter 2: Introduction to Generalized Linear Models

Author

Steffen Foerster

2.1 Beyond Linear Regression: The Need for Generalized Linear Models

In introductory statistics, we learned about linear regression - a powerful tool for modeling continuous response variables. Linear regression assumes:

The response variable is continuous and can take any value on the real number line
The relationship between predictors and response is linear
Errors are normally distributed
Variance is constant across all levels of predictors (homoscedasticity)

But ecological data often violate these assumptions:

Data types that challenge linear regression

Binary outcomes: presence/absence, alive/dead, success/failure
Count data: number of individuals, number of offspring, number of species
Proportions: percent cover, survival rates, germination rates
Rates: events per unit time or area

None of these fit the assumptions of normal linear regression!

Generalized Linear Models (GLMs) extend linear regression to handle these different data types while maintaining a unified framework.

2.2 The Three Components of a GLM

Every GLM consists of three key components:

2.2.1 The Random Component: Distribution of Y

This specifies the probability distribution of the response variable. Different distributions are appropriate for different data types:

library(tidyverse)
library(gridExtra)

# Set global theme with larger text
theme_set(theme_minimal(base_size = 16))

Code

# Create example distributions
set.seed(123)
x_vals <- 0:20

# Normal distribution (linear regression)
normal_data <- data.frame(
  x = seq(-10, 10, length.out = 100),
  density = dnorm(seq(-10, 10, length.out = 100), 0, 2),
  Distribution = "Normal\n(Continuous data)"
)

# Binomial distribution (binary/proportion data)
binomial_data <- data.frame(
  x = 0:10,
  density = dbinom(0:10, size = 10, prob = 0.3),
  Distribution = "Binomial\n(Binary/Count data)"
)

# Poisson distribution (count data)
poisson_data <- data.frame(
  x = x_vals,
  density = dpois(x_vals, lambda = 5),
  Distribution = "Poisson\n(Count data)"
)

# Gamma distribution (continuous positive data)
gamma_data <- data.frame(
  x = seq(0.1, 20, length.out = 100),
  density = dgamma(seq(0.1, 20, length.out = 100), shape = 2, rate = 0.5),
  Distribution = "Gamma\n(Positive continuous)"
)

# Plot
p1 <- ggplot(normal_data, aes(x = x, y = density)) +
  geom_line(color = "blue", linewidth = 1.2) +
  labs(title = "Normal Distribution", x = "Value", y = "Density")

p2 <- ggplot(binomial_data, aes(x = x, y = density)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  labs(title = "Binomial Distribution", x = "Number of Successes", y = "Probability")

p3 <- ggplot(poisson_data, aes(x = x, y = density)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Poisson Distribution", x = "Count", y = "Probability")

p4 <- ggplot(gamma_data, aes(x = x, y = density)) +
  geom_line(color = "purple", linewidth = 1.2) +
  labs(title = "Gamma Distribution", x = "Value", y = "Density")

grid.arrange(p1, p2, p3, p4, ncol = 2)

Choosing the right distribution

The distribution should match your data type:

Normal: Continuous data with no bounds (temperature, height, weight)
Binomial: Binary outcomes or proportions with known denominators
Poisson: Counts of events (number of individuals, number of occurrences)
Gamma: Positive continuous data (response times, waiting times)
Negative Binomial: Overdispersed count data (coming in Chapter 5)
Beta: Proportions bounded between 0 and 1 (coming in Chapter 7)

2.2.2 The Systematic Component: Linear Predictor

This is the familiar linear combination of predictors:

\[\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p\]

Where: - \(\eta\) (eta) is called the linear predictor - \(\beta_0, \beta_1, \ldots, \beta_p\) are coefficients we estimate - \(X_1, X_2, \ldots, X_p\) are our predictor variables

This part is exactly like linear regression! We’re still building a linear combination of our predictors.

2.2.3 The Link Function: Connecting Linear Predictor to Response

The link function \(g()\) connects the linear predictor to the expected value of the response:

\[g(E(Y)) = \eta = \beta_0 + \beta_1 X_1 + \ldots\]

Or equivalently:

\[E(Y) = g^{-1}(\eta)\]

Different link functions are used with different distributions:

Code

# Create data for visualization
linear_pred <- seq(-5, 5, length.out = 100)

link_data <- data.frame(
  linear_pred = linear_pred,
  identity = linear_pred,  # For normal (no transformation)
  logit = plogis(linear_pred),  # For binomial
  log = exp(linear_pred),  # For Poisson
  inverse = 1/linear_pred  # For gamma (only positive values)
)

# Plot identity link
p1 <- ggplot(link_data, aes(x = linear_pred, y = identity)) +
  geom_line(color = "blue", linewidth = 1.2) +
  labs(title = "Identity Link\n(Normal distribution)",
       x = "Linear Predictor (η)",
       y = "E(Y)")

# Plot logit link
p2 <- ggplot(link_data, aes(x = linear_pred, y = logit)) +
  geom_line(color = "darkgreen", linewidth = 1.2) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "red") +
  labs(title = "Logit Link\n(Binomial distribution)",
       x = "Linear Predictor (η)",
       y = "E(Y) = Probability") +
  ylim(-0.1, 1.1)

# Plot log link
p3 <- ggplot(link_data, aes(x = linear_pred, y = log)) +
  geom_line(color = "red", linewidth = 1.2) +
  coord_cartesian(ylim = c(0, 20)) +
  labs(title = "Log Link\n(Poisson distribution)",
       x = "Linear Predictor (η)",
       y = "E(Y) = Count")

grid.arrange(p1, p2, p3, ncol = 3)

Why do we need link functions?

Link functions serve two crucial purposes:

Constrain predictions to valid ranges:
- Logit link keeps probabilities between 0 and 1
- Log link ensures counts are positive
- This prevents impossible predictions!
Linearize non-linear relationships:
- Many biological relationships are non-linear on the original scale
- But linear on the transformed scale
- Makes estimation and interpretation more tractable

2.3 The GLM Framework in R

In R, we fit GLMs using the glm() function, which has a very similar syntax to lm():

# General syntax
model <- glm(response ~ predictor1 + predictor2 + ...,
            family = distribution_family(link = "link_function"),
            data = your_data)

The key difference from lm() is the family argument, which specifies both the distribution and link function.

2.3.1 Common Family Specifications

# For binary data (presence/absence)
glm(presence ~ temperature, family = binomial(link = "logit"), data = df)

# For count data
glm(count ~ habitat, family = poisson(link = "log"), data = df)

# For continuous data (same as lm)
glm(weight ~ age, family = gaussian(link = "identity"), data = df)

Default link functions

Each distribution family has a default (“canonical”) link function:

family = binomial → logit link (most common)
family = poisson → log link
family = gaussian → identity link
family = Gamma → inverse link

You can usually omit the link specification and use the default:

glm(presence ~ temperature, family = binomial, data = df)  # Assumes logit link

2.4 A Concrete Example: Seedling Growth Data

Let’s work through a complete example using real data to understand how GLMs work in practice. We’ll use a simple dataset to illustrate that linear models are just a special case of GLMs.

2.4.1 The Dataset

We’ll analyze data from a tree seedling growth experiment. Researchers measured the height of Douglas fir seedlings at different ages to understand growth patterns. This is a classic continuous response variable that should follow a normal distribution - perfect for demonstrating the GLM framework.

Dataset: seedling_growth.csv

Variables: - age_weeks: Age of seedling in weeks (predictor) - height_cm: Height of seedling in centimeters (response) - light_level: Amount of light exposure (low, medium, high)

Study context: Seedlings were grown in a greenhouse under controlled conditions with varying light levels. Measurements were taken weekly for 10 weeks.

2.4.2 Loading and Exploring the Data

Always start with data exploration! This helps you understand your data structure, identify potential issues, and inform your modeling decisions.

# Load the data
seedling_data <- read.csv("data/seedling_growth.csv")

# First look at the structure
str(seedling_data)

'data.frame':   100 obs. of  3 variables:
 $ age_weeks  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ height_cm  : num  2.05 3.32 5.75 2.56 3.6 ...
 $ light_level: chr  "low" "medium" "high" "low" ...

# Summary statistics
summary(seedling_data)

   age_weeks      height_cm     light_level       
 Min.   : 1.0   Min.   :2.052   Length:100        
 1st Qu.: 3.0   1st Qu.:4.440   Class :character  
 Median : 5.5   Median :5.919   Mode  :character  
 Mean   : 5.5   Mean   :5.812                     
 3rd Qu.: 8.0   3rd Qu.:7.148                     
 Max.   :10.0   Max.   :9.419

# Check for missing values
colSums(is.na(seedling_data))

  age_weeks   height_cm light_level 
          0           0           0

# Sample size
nrow(seedling_data)

[1] 100

Initial observations

From our data exploration:

We have 100 observations
Age ranges from 1 to 10 weeks
Height ranges from 2.1 to 9.4 cm
Mean height is 5.8 cm (SD = 1.8)
No missing values - data is complete
Light level is a factor with three levels

This appears to be clean, well-structured data suitable for analysis.

2.4.3 Visual Data Exploration

Let’s visualize the relationships in our data:

# Plot 1: Histogram of height
p1 <- ggplot(seedling_data, aes(x = height_cm)) +
  geom_histogram(bins = 20, fill = "forestgreen", color = "black") +
  labs(title = "Distribution of Seedling Height",
       x = "Height (cm)",
       y = "Frequency")

# Plot 2: Height vs Age
p2 <- ggplot(seedling_data, aes(x = age_weeks, y = height_cm)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Seedling Height vs. Age",
       x = "Age (weeks)",
       y = "Height (cm)")

# Plot 3: Height by light level
p3 <- ggplot(seedling_data, aes(x = light_level, y = height_cm, fill = light_level)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(title = "Height by Light Level",
       x = "Light Level",
       y = "Height (cm)") +
  theme(legend.position = "none")

# Plot 4: Height vs Age by light level
p4 <- ggplot(seedling_data, aes(x = age_weeks, y = height_cm, color = light_level)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_viridis_d() +
  labs(title = "Growth Trajectories by Light Level",
       x = "Age (weeks)",
       y = "Height (cm)",
       color = "Light Level")

grid.arrange(p1, p2, p3, p4, ncol = 2)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Exploratory visualizations of seedling growth data

Key patterns from visual exploration

Distribution: Height appears roughly normally distributed (good for gaussian GLM!)
Age relationship: Clear positive linear relationship between age and height
- Seedlings grow steadily over time
- Relationship appears linear (no obvious curvature)
Light effect: Seedlings under high light tend to be taller
- Suggests light level is an important predictor
- Some overlap between groups
Interaction?: Growth trajectories (slopes) look similar across light levels
- Suggests light level affects starting height more than growth rate
- An additive model may be appropriate

These insights guide our modeling approach!

2.5 Fitting a GLM to the Seedling Data

Now let’s fit models to understand seedling growth. We’ll start with a simple model and build up.

2.5.1 Simple Linear Model as a GLM

# Fit using lm() - the traditional approach
lm_model <- lm(height_cm ~ age_weeks, data = seedling_data)

# Fit using glm() with gaussian family
glm_model <- glm(height_cm ~ age_weeks, family = gaussian, data = seedling_data)

# Compare results
summary(lm_model)$coefficients

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 2.9473846 0.22828513 12.91098 7.224042e-23
age_weeks   0.5208982 0.03679148 14.15812 1.953232e-25

summary(glm_model)$coefficients

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 2.9473846 0.22828513 12.91098 7.224042e-23
age_weeks   0.5208982 0.03679148 14.15812 1.953232e-25

They’re identical!

Notice that lm() and glm() with family = gaussian give identical results: - Same coefficient estimates - Same standard errors - Same t-values and p-values

This demonstrates that linear regression is just a special case of the GLM framework: - Uses the normal (gaussian) distribution - With the identity link function (no transformation) - Estimated using maximum likelihood (equivalent to least squares for normal data)

So lm() is really just a convenient shortcut for glm(y ~ x, family = gaussian)!

2.5.2 Interpreting the Simple Model

# Extract coefficients
intercept <- coef(glm_model)[1]
slope <- coef(glm_model)[2]

# Display values
intercept

(Intercept) 
   2.947385

slope

age_weeks 
0.5208982

# Visualize the fit
ggplot(seedling_data, aes(x = age_weeks, y = height_cm)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Seedling Growth Model",
       subtitle = paste0("Height = ", round(intercept, 1), " + ", 
                        round(slope, 2), " × Age"),
       x = "Age (weeks)",
       y = "Height (cm)")

`geom_smooth()` using formula = 'y ~ x'

Simple linear relationship between age and height

2.5.3 Adding Light Level to the Model

# Fit a model with both age and light level
full_glm <- glm(height_cm ~ age_weeks + light_level, 
               family = gaussian, 
               data = seedling_data)

summary(full_glm)


Call:
glm(formula = height_cm ~ age_weeks + light_level, family = gaussian, 
    data = seedling_data)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.93829    0.19078  20.643  < 2e-16 ***
age_weeks          0.51897    0.02555  20.310  < 2e-16 ***
light_levellow    -1.85147    0.17933 -10.324  < 2e-16 ***
light_levelmedium -1.06295    0.18068  -5.883 5.87e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.5385502)

    Null deviance: 333.291  on 99  degrees of freedom
Residual deviance:  51.701  on 96  degrees of freedom
AIC: 227.82

Number of Fisher Scoring iterations: 2

Interpreting the fuller model

With both predictors included:

Age effect (slope): 0.519 cm/week - Seedlings grow about 0.52 cm per week - This is the growth rate, holding light level constant

Light level effects (compared to baseline - low light): - Medium light: -1.06 cm taller on average - High light: NA cm taller on average

All predictors are statistically significant (p < 0.001), confirming that both age and light affect seedling height.

2.6 Maximum Likelihood Estimation

GLMs are fit using maximum likelihood estimation (MLE) rather than least squares.

2.6.1 What is Maximum Likelihood?

The idea is simple: we choose parameter values that make our observed data most likely to have occurred.

Code

# Simple example: estimating the probability of success
# Suppose we observe 7 successes out of 10 trials

# Try different probability values
prob_values <- seq(0.01, 0.99, by = 0.01)

# Calculate likelihood for each probability
# Likelihood = P(data | parameter)
likelihood <- dbinom(7, size = 10, prob = prob_values)

likelihood_df <- data.frame(
  probability = prob_values,
  likelihood = likelihood
)

# Find maximum
mle_estimate <- prob_values[which.max(likelihood)]

# Plot
ggplot(likelihood_df, aes(x = probability, y = likelihood)) +
  geom_line(color = "blue", linewidth = 1.2) +
  geom_vline(xintercept = mle_estimate, linetype = "dashed", color = "red", linewidth = 1) +
  geom_point(x = mle_estimate, y = max(likelihood), color = "red", size = 3) +
  annotate("text", x = mle_estimate + 0.15, y = max(likelihood), 
           label = paste("MLE =", mle_estimate), color = "red", size = 5) +
  labs(
    title = "Maximum Likelihood Estimation",
    subtitle = "Finding the parameter value that maximizes the likelihood of observing our data",
    x = "Probability Parameter",
    y = "Likelihood"
  )

MLE vs. Least Squares

Least Squares (used in lm()): - Minimizes the sum of squared residuals - Works well for normally distributed data - Closed-form solution (direct calculation)

Maximum Likelihood (used in glm()): - Maximizes the probability of observing the data - Works for any distribution - Usually requires iterative numerical optimization - More general and flexible

Key insight: For normal data with identity link, MLE and least squares give the same answer! This is why our lm() and glm() results were identical.

2.7 Model Fit and Deviance

Instead of R-squared and residual sums of squares, GLMs use deviance to measure model fit.

Deviance is analogous to the residual sum of squares, but for GLMs:

\[D = -2 \times [\log(\text{likelihood of fitted model}) - \log(\text{likelihood of saturated model})]\]

# Extract deviance information from our model
full_glm$null.deviance

[1] 333.2909

full_glm$deviance

[1] 51.70082

# Calculate reduction in deviance
full_glm$null.deviance - full_glm$deviance

[1] 281.5901

# Calculate pseudo R-squared
pseudo_r2 <- 1 - (full_glm$deviance / full_glm$null.deviance)
pseudo_r2

[1] 0.8448778

Understanding deviance

Lower deviance = better fit
Null deviance: Deviance of a model with only an intercept
Residual deviance: Deviance of your fitted model
Reduction in deviance: How much better your model is than the null

The reduction in deviance follows a chi-squared distribution, which we use for hypothesis tests!

For our seedling model: - We reduced deviance by 281.6 - This represents a 84.5% reduction - Our predictors substantially improve the model

2.8 Why GLMs Matter for Ecological Data

Ecological data are rarely normally distributed. Let’s visualize some common ecological data types:

Code

# Simulate realistic ecological data
set.seed(456)

# Species presence/absence
presence <- rbinom(100, 1, 0.6)

# Count data (number of individuals)
counts <- rpois(100, 3)

# Proportion data (percent cover)
proportions <- rbeta(100, 2, 2)

# Create comparison plot
eco_data <- data.frame(
  Presence = presence[1:50],
  Counts = counts[1:50],
  Proportions = proportions[1:50]
)

# Reshape for plotting
eco_long <- tidyr::pivot_longer(eco_data, 
                                cols = everything(),
                                names_to = "DataType",
                                values_to = "Value")

ggplot(eco_long, aes(x = Value)) +
  geom_histogram(bins = 15, fill = "forestgreen", color = "black") +
  facet_wrap(~DataType, scales = "free") +
  labs(
    title = "Common Types of Ecological Data",
    subtitle = "None of these are normally distributed!",
    x = "Value",
    y = "Frequency"
  )

Why GLMs are essential for ecology

Match the data structure:
- Binary data → binomial GLM
- Count data → Poisson/negative binomial GLM
- Proportions → beta regression GLM
Respect constraints:
- Probabilities stay between 0 and 1
- Counts stay non-negative
- Appropriate variance structures
Avoid invalid predictions:
- No negative counts
- No probabilities > 1
- Biologically meaningful scales
Better statistical properties:
- Proper hypothesis tests
- Accurate confidence intervals
- Correct standard errors

Using linear regression for non-normal data gives wrong answers!

2.9 Looking Ahead: The GLM Family

In this course, we’ll explore the GLM family systematically:

  Chapter                Model                Data_Type
1       3  Logistic Regression             Binary (0/1)
2       4   Poisson Regression                   Counts
3       5    Negative Binomial     Overdispersed Counts
4       6 Zero-Inflated Models Counts with Excess Zeros
5       7      Beta Regression        Proportions (0-1)
              Distribution        Link
1                 Binomial       Logit
2                  Poisson         Log
3        Negative Binomial         Log
4 Zero-Inflated Poisson/NB Logit + Log
5                     Beta       Logit

Each chapter will build on this GLM framework, adding new distributions and link functions as needed for different data types.

2.10 Key Takeaways

GLMs extend linear regression to handle non-normal data
Three components: random (distribution), systematic (linear predictor), link function
Link functions connect linear predictors to response scales and constrain predictions
glm() function in R implements the framework
Family argument specifies distribution and link
Maximum likelihood estimation instead of least squares
Deviance replaces sums of squares for measuring fit
Always explore your data first - understand structure, patterns, and potential issues
Essential for ecology because real data are rarely normal

2.11 Additional Resources

Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC. [The classic text]
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models (2nd ed.). CRC Press.
Dunn, P. K., & Smyth, G. K. (2018). Generalized Linear Models with Examples in R. Springer.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed Effects Models and Extensions in Ecology with R. Springer. [Chapter 2 on GLMs]

Now that we understand the GLM framework, we’re ready to apply it to binary data with logistic regression in Chapter 3!