library(tidyverse)
library(gridExtra)Chapter 2: Introduction to Generalized Linear Models
2.1 Beyond Linear Regression: The Need for Generalized Linear Models
In introductory statistics, we learned about linear regression - a powerful tool for modeling continuous response variables. Linear regression assumes:
- The response variable is continuous and can take any value on the real number line
- The relationship between predictors and response is linear
- Errors are normally distributed
- Variance is constant across all levels of predictors (homoscedasticity)
But ecological data often violate these assumptions:
- Binary outcomes: presence/absence, alive/dead, success/failure
- Count data: number of individuals, number of offspring, number of species
- Proportions: percent cover, survival rates, germination rates
- Rates: events per unit time or area
None of these fit the assumptions of normal linear regression!
Generalized Linear Models (GLMs) extend linear regression to handle these different data types while maintaining a unified framework.
2.2 The Three Components of a GLM
Every GLM consists of three key components:
2.2.1 The Random Component: Distribution of Y
This specifies the probability distribution of the response variable. Different distributions are appropriate for different data types:
# Create example distributions
set.seed(123)
x_vals <- 0:20
# Normal distribution (linear regression)
normal_data <- data.frame(
x = seq(-10, 10, length.out = 100),
density = dnorm(seq(-10, 10, length.out = 100), 0, 2),
Distribution = "Normal\n(Continuous data)"
)
# Binomial distribution (binary/proportion data)
binomial_data <- data.frame(
x = 0:10,
density = dbinom(0:10, size = 10, prob = 0.3),
Distribution = "Binomial\n(Binary/Count data)"
)
# Poisson distribution (count data)
poisson_data <- data.frame(
x = x_vals,
density = dpois(x_vals, lambda = 5),
Distribution = "Poisson\n(Count data)"
)
# Gamma distribution (continuous positive data)
gamma_data <- data.frame(
x = seq(0.1, 20, length.out = 100),
density = dgamma(seq(0.1, 20, length.out = 100), shape = 2, rate = 0.5),
Distribution = "Gamma\n(Positive continuous)"
)
# Plot
p1 <- ggplot(normal_data, aes(x = x, y = density)) +
geom_line(color = "blue", size = 1.2) +
labs(title = "Normal Distribution", x = "Value", y = "Density") +
theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p2 <- ggplot(binomial_data, aes(x = x, y = density)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Binomial Distribution", x = "Number of Successes", y = "Probability") +
theme_minimal()
p3 <- ggplot(poisson_data, aes(x = x, y = density)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "Poisson Distribution", x = "Count", y = "Probability") +
theme_minimal()
p4 <- ggplot(gamma_data, aes(x = x, y = density)) +
geom_line(color = "purple", size = 1.2) +
labs(title = "Gamma Distribution", x = "Value", y = "Density") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)The distribution should match your data type:
- Normal: Continuous data with no bounds (temperature, height, weight)
- Binomial: Binary outcomes or proportions with known denominators
- Poisson: Counts of events (number of individuals, number of occurrences)
- Gamma: Positive continuous data (response times, waiting times)
- Negative Binomial: Overdispersed count data (coming in Chapter 5)
- Beta: Proportions bounded between 0 and 1 (coming in Chapter 7)
2.2.2 The Systematic Component: Linear Predictor
This is the familiar linear combination of predictors:
\[\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p\]
Where: - \(\eta\) (eta) is called the linear predictor - \(\beta_0, \beta_1, \ldots, \beta_p\) are coefficients we estimate - \(X_1, X_2, \ldots, X_p\) are our predictor variables
This part is exactly like linear regression! We’re still building a linear combination of our predictors.
2.2.3 The Link Function: Connecting Linear Predictor to Response
The link function \(g()\) connects the linear predictor to the expected value of the response:
\[g(E(Y)) = \eta = \beta_0 + \beta_1 X_1 + \ldots\]
Or equivalently:
\[E(Y) = g^{-1}(\eta)\]
Different link functions are used with different distributions:
# Create data for visualization
linear_pred <- seq(-5, 5, length.out = 100)
link_data <- data.frame(
linear_pred = linear_pred,
identity = linear_pred, # For normal (no transformation)
logit = plogis(linear_pred), # For binomial
log = exp(linear_pred), # For Poisson
inverse = 1/linear_pred # For gamma (only positive values)
)
# Plot identity link
p1 <- ggplot(link_data, aes(x = linear_pred, y = identity)) +
geom_line(color = "blue", size = 1.2) +
labs(title = "Identity Link\n(Normal distribution)",
x = "Linear Predictor (η)",
y = "E(Y)") +
theme_minimal()
# Plot logit link
p2 <- ggplot(link_data, aes(x = linear_pred, y = logit)) +
geom_line(color = "darkgreen", size = 1.2) +
geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "red") +
labs(title = "Logit Link\n(Binomial distribution)",
x = "Linear Predictor (η)",
y = "E(Y) = Probability") +
ylim(-0.1, 1.1) +
theme_minimal()
# Plot log link
p3 <- ggplot(link_data, aes(x = linear_pred, y = log)) +
geom_line(color = "red", size = 1.2) +
coord_cartesian(ylim = c(0, 20)) +
labs(title = "Log Link\n(Poisson distribution)",
x = "Linear Predictor (η)",
y = "E(Y) = Count") +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 3)Link functions serve two crucial purposes:
- Constrain predictions to valid ranges:
- Logit link keeps probabilities between 0 and 1
- Log link ensures counts are positive
- This prevents impossible predictions!
- Linearize non-linear relationships:
- Many biological relationships are non-linear on the original scale
- But linear on the transformed scale
- Makes estimation and interpretation more tractable
2.3 The GLM Framework in R
In R, we fit GLMs using the glm() function, which has a very similar syntax to lm():
# General syntax
model <- glm(response ~ predictor1 + predictor2 + ...,
family = distribution_family(link = "link_function"),
data = your_data)The key difference from lm() is the family argument, which specifies both the distribution and link function.
2.3.1 Common Family Specifications
# For binary data (presence/absence)
glm(presence ~ temperature, family = binomial(link = "logit"), data = df)
# For count data
glm(count ~ habitat, family = poisson(link = "log"), data = df)
# For continuous data (same as lm)
glm(weight ~ age, family = gaussian(link = "identity"), data = df)Each distribution family has a default (“canonical”) link function:
family = binomial→ logit link (most common)family = poisson→ log linkfamily = gaussian→ identity linkfamily = Gamma→ inverse link
You can usually omit the link specification and use the default:
glm(presence ~ temperature, family = binomial, data = df) # Assumes logit link2.4 A Concrete Example: Seedling Growth Data
Let’s work through a complete example using real data to understand how GLMs work in practice. We’ll use a simple dataset to illustrate that linear models are just a special case of GLMs.
2.4.1 The Dataset
We’ll analyze data from a tree seedling growth experiment. Researchers measured the height of Douglas fir seedlings at different ages to understand growth patterns. This is a classic continuous response variable that should follow a normal distribution - perfect for demonstrating the GLM framework.
Dataset: seedling_growth.csv
Variables: - age_weeks: Age of seedling in weeks (predictor) - height_cm: Height of seedling in centimeters (response) - light_level: Amount of light exposure (low, medium, high)
Study context: Seedlings were grown in a greenhouse under controlled conditions with varying light levels. Measurements were taken weekly for 10 weeks.
2.4.2 Loading and Exploring the Data
Always start with data exploration! This helps you understand your data structure, identify potential issues, and inform your modeling decisions.
# Load the data
seedling_data <- read.csv("data/seedling_growth.csv")
# First look at the structure
str(seedling_data)'data.frame': 100 obs. of 3 variables:
$ age_weeks : int 1 1 1 1 1 1 1 1 1 1 ...
$ height_cm : num 2.05 3.32 5.75 2.56 3.6 ...
$ light_level: chr "low" "medium" "high" "low" ...
# Summary statistics
summary(seedling_data) age_weeks height_cm light_level
Min. : 1.0 Min. :2.052 Length:100
1st Qu.: 3.0 1st Qu.:4.440 Class :character
Median : 5.5 Median :5.919 Mode :character
Mean : 5.5 Mean :5.812
3rd Qu.: 8.0 3rd Qu.:7.148
Max. :10.0 Max. :9.419
# Check for missing values
cat("\nMissing values:\n")
Missing values:
colSums(is.na(seedling_data)) age_weeks height_cm light_level
0 0 0
# Sample size
cat("\nTotal observations:", nrow(seedling_data), "\n")
Total observations: 100
From our data exploration:
- We have 100 observations
- Age ranges from 1 to 10 weeks
- Height ranges from 2.1 to 9.4 cm
- Mean height is 5.8 cm (SD = 1.8)
- No missing values - data is complete
- Light level is a factor with three levels
This appears to be clean, well-structured data suitable for analysis.
2.4.3 Visual Data Exploration
Let’s visualize the relationships in our data:
# Plot 1: Histogram of height
p1 <- ggplot(seedling_data, aes(x = height_cm)) +
geom_histogram(bins = 20, fill = "forestgreen", color = "black") +
labs(title = "Distribution of Seedling Height",
x = "Height (cm)",
y = "Frequency") +
theme_minimal()
# Plot 2: Height vs Age
p2 <- ggplot(seedling_data, aes(x = age_weeks, y = height_cm)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Seedling Height vs. Age",
x = "Age (weeks)",
y = "Height (cm)") +
theme_minimal()
# Plot 3: Height by light level
p3 <- ggplot(seedling_data, aes(x = light_level, y = height_cm, fill = light_level)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(title = "Height by Light Level",
x = "Light Level",
y = "Height (cm)") +
theme_minimal() +
theme(legend.position = "none")
# Plot 4: Height vs Age by light level
p4 <- ggplot(seedling_data, aes(x = age_weeks, y = height_cm, color = light_level)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
labs(title = "Growth Trajectories by Light Level",
x = "Age (weeks)",
y = "Height (cm)",
color = "Light Level") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Distribution: Height appears roughly normally distributed (good for gaussian GLM!)
Age relationship: Clear positive linear relationship between age and height
- Seedlings grow steadily over time
- Relationship appears linear (no obvious curvature)
Light effect: Seedlings under high light tend to be taller
- Suggests light level is an important predictor
- Some overlap between groups
Interaction?: Growth trajectories (slopes) look similar across light levels
- Suggests light level affects starting height more than growth rate
- An additive model may be appropriate
These insights guide our modeling approach!
2.5 Fitting a GLM to the Seedling Data
Now let’s fit models to understand seedling growth. We’ll start with a simple model and build up.
2.5.1 Simple Linear Model as a GLM
# Fit using lm() - the traditional approach
lm_model <- lm(height_cm ~ age_weeks, data = seedling_data)
# Fit using glm() with gaussian family
glm_model <- glm(height_cm ~ age_weeks, family = gaussian, data = seedling_data)
# Compare results
cat("=== Linear Model (lm) Coefficients ===\n")=== Linear Model (lm) Coefficients ===
print(summary(lm_model)$coefficients) Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9473846 0.22828513 12.91098 7.224042e-23
age_weeks 0.5208982 0.03679148 14.15812 1.953232e-25
cat("\n=== GLM with Gaussian Family Coefficients ===\n")
=== GLM with Gaussian Family Coefficients ===
print(summary(glm_model)$coefficients) Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9473846 0.22828513 12.91098 7.224042e-23
age_weeks 0.5208982 0.03679148 14.15812 1.953232e-25
Notice that lm() and glm() with family = gaussian give identical results: - Same coefficient estimates - Same standard errors - Same t-values and p-values
This demonstrates that linear regression is just a special case of the GLM framework: - Uses the normal (gaussian) distribution - With the identity link function (no transformation) - Estimated using maximum likelihood (equivalent to least squares for normal data)
So lm() is really just a convenient shortcut for glm(y ~ x, family = gaussian)!
2.5.2 Interpreting the Simple Model
# Extract coefficients
intercept <- coef(glm_model)[1]
slope <- coef(glm_model)[2]
cat("Interpretation of coefficients:\n")Interpretation of coefficients:
cat("Intercept =", round(intercept, 2), "cm\n")Intercept = 2.95 cm
cat(" → Expected height at week 0 (germination)\n\n") → Expected height at week 0 (germination)
cat("Slope =", round(slope, 2), "cm per week\n")Slope = 0.52 cm per week
cat(" → Seedlings grow approximately", round(slope, 2), "cm each week\n") → Seedlings grow approximately 0.52 cm each week
# Visualize the fit
ggplot(seedling_data, aes(x = age_weeks, y = height_cm)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Seedling Growth Model",
subtitle = paste0("Height = ", round(intercept, 1), " + ",
round(slope, 2), " × Age"),
x = "Age (weeks)",
y = "Height (cm)") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
2.5.3 Adding Light Level to the Model
# Fit a model with both age and light level
full_glm <- glm(height_cm ~ age_weeks + light_level,
family = gaussian,
data = seedling_data)
summary(full_glm)
Call:
glm(formula = height_cm ~ age_weeks + light_level, family = gaussian,
data = seedling_data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93829 0.19078 20.643 < 2e-16 ***
age_weeks 0.51897 0.02555 20.310 < 2e-16 ***
light_levellow -1.85147 0.17933 -10.324 < 2e-16 ***
light_levelmedium -1.06295 0.18068 -5.883 5.87e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.5385502)
Null deviance: 333.291 on 99 degrees of freedom
Residual deviance: 51.701 on 96 degrees of freedom
AIC: 227.82
Number of Fisher Scoring iterations: 2
With both predictors included:
Age effect (slope): 0.519 cm/week - Seedlings grow about 0.52 cm per week - This is the growth rate, holding light level constant
Light level effects (compared to baseline - low light): - Medium light: -1.06 cm taller on average - High light: NA cm taller on average
All predictors are statistically significant (p < 0.001), confirming that both age and light affect seedling height.
2.6 Maximum Likelihood Estimation
GLMs are fit using maximum likelihood estimation (MLE) rather than least squares.
2.6.1 What is Maximum Likelihood?
The idea is simple: we choose parameter values that make our observed data most likely to have occurred.
# Simple example: estimating the probability of success
# Suppose we observe 7 successes out of 10 trials
# Try different probability values
prob_values <- seq(0.01, 0.99, by = 0.01)
# Calculate likelihood for each probability
# Likelihood = P(data | parameter)
likelihood <- dbinom(7, size = 10, prob = prob_values)
likelihood_df <- data.frame(
probability = prob_values,
likelihood = likelihood
)
# Find maximum
mle_estimate <- prob_values[which.max(likelihood)]
# Plot
ggplot(likelihood_df, aes(x = probability, y = likelihood)) +
geom_line(color = "blue", size = 1.2) +
geom_vline(xintercept = mle_estimate, linetype = "dashed", color = "red", size = 1) +
geom_point(x = mle_estimate, y = max(likelihood), color = "red", size = 3) +
annotate("text", x = mle_estimate + 0.15, y = max(likelihood),
label = paste("MLE =", mle_estimate), color = "red", size = 5) +
labs(
title = "Maximum Likelihood Estimation",
subtitle = "Finding the parameter value that maximizes the likelihood of observing our data",
x = "Probability Parameter",
y = "Likelihood"
) +
theme_minimal()Least Squares (used in lm()): - Minimizes the sum of squared residuals - Works well for normally distributed data - Closed-form solution (direct calculation)
Maximum Likelihood (used in glm()): - Maximizes the probability of observing the data - Works for any distribution - Usually requires iterative numerical optimization - More general and flexible
Key insight: For normal data with identity link, MLE and least squares give the same answer! This is why our lm() and glm() results were identical.
2.7 Model Fit and Deviance
Instead of R-squared and residual sums of squares, GLMs use deviance to measure model fit.
Deviance is analogous to the residual sum of squares, but for GLMs:
\[D = -2 \times [\log(\text{likelihood of fitted model}) - \log(\text{likelihood of saturated model})]\]
# Extract deviance information from our model
cat("Null Deviance:", full_glm$null.deviance, "\n")Null Deviance: 333.2909
cat(" (model with only intercept)\n\n") (model with only intercept)
cat("Residual Deviance:", full_glm$deviance, "\n")Residual Deviance: 51.70082
cat(" (our fitted model)\n\n") (our fitted model)
cat("Reduction in Deviance:", full_glm$null.deviance - full_glm$deviance, "\n")Reduction in Deviance: 281.5901
cat(" (improvement from adding predictors)\n\n") (improvement from adding predictors)
# Calculate pseudo R-squared
pseudo_r2 <- 1 - (full_glm$deviance / full_glm$null.deviance)
cat("Pseudo R-squared:", round(pseudo_r2, 4), "\n")Pseudo R-squared: 0.8449
cat(" (proportion of deviance explained)\n") (proportion of deviance explained)
- Lower deviance = better fit
- Null deviance: Deviance of a model with only an intercept
- Residual deviance: Deviance of your fitted model
- Reduction in deviance: How much better your model is than the null
The reduction in deviance follows a chi-squared distribution, which we use for hypothesis tests!
For our seedling model: - We reduced deviance by 281.6 - This represents a 84.5% reduction - Our predictors substantially improve the model
2.8 Why GLMs Matter for Ecological Data
Ecological data are rarely normally distributed. Let’s visualize some common ecological data types:
# Simulate realistic ecological data
set.seed(456)
# Species presence/absence
presence <- rbinom(100, 1, 0.6)
# Count data (number of individuals)
counts <- rpois(100, 3)
# Proportion data (percent cover)
proportions <- rbeta(100, 2, 2)
# Create comparison plot
eco_data <- data.frame(
Presence = presence[1:50],
Counts = counts[1:50],
Proportions = proportions[1:50]
)
# Reshape for plotting
eco_long <- tidyr::pivot_longer(eco_data,
cols = everything(),
names_to = "DataType",
values_to = "Value")
ggplot(eco_long, aes(x = Value)) +
geom_histogram(bins = 15, fill = "forestgreen", color = "black") +
facet_wrap(~DataType, scales = "free") +
labs(
title = "Common Types of Ecological Data",
subtitle = "None of these are normally distributed!",
x = "Value",
y = "Frequency"
) +
theme_minimal()- Match the data structure:
- Binary data → binomial GLM
- Count data → Poisson/negative binomial GLM
- Proportions → beta regression GLM
- Respect constraints:
- Probabilities stay between 0 and 1
- Counts stay non-negative
- Appropriate variance structures
- Avoid invalid predictions:
- No negative counts
- No probabilities > 1
- Biologically meaningful scales
- Better statistical properties:
- Proper hypothesis tests
- Accurate confidence intervals
- Correct standard errors
Using linear regression for non-normal data gives wrong answers!
2.9 Looking Ahead: The GLM Family
In this course, we’ll explore the GLM family systematically:
Chapter Model Data_Type
1 3 Logistic Regression Binary (0/1)
2 4 Poisson Regression Counts
3 5 Negative Binomial Overdispersed Counts
4 6 Zero-Inflated Models Counts with Excess Zeros
5 7 Beta Regression Proportions (0-1)
Distribution Link
1 Binomial Logit
2 Poisson Log
3 Negative Binomial Log
4 Zero-Inflated Poisson/NB Logit + Log
5 Beta Logit
Each chapter will build on this GLM framework, adding new distributions and link functions as needed for different data types.
2.10 Key Takeaways
- GLMs extend linear regression to handle non-normal data
- Three components: random (distribution), systematic (linear predictor), link function
- Link functions connect linear predictors to response scales and constrain predictions
glm()function in R implements the framework- Family argument specifies distribution and link
- Maximum likelihood estimation instead of least squares
- Deviance replaces sums of squares for measuring fit
- Always explore your data first - understand structure, patterns, and potential issues
- Essential for ecology because real data are rarely normal
2.11 Additional Resources
Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC. [The classic text]
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models (2nd ed.). CRC Press.
Dunn, P. K., & Smyth, G. K. (2018). Generalized Linear Models with Examples in R. Springer.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed Effects Models and Extensions in Ecology with R. Springer. [Chapter 2 on GLMs]
Now that we understand the GLM framework, we’re ready to apply it to binary data with logistic regression in Chapter 3!