library(tidyverse)
library(MASS) # for additional GLM functions
library(performance) # for model comparison
library(car) # for diagnosticsChapter 4: Poisson Regression for Count Data
4.1 From Probabilities to Counts: When Your Response is a Whole Number
In Chapter 3, we learned how to model binary data (presence/absence) using logistic regression with the binomial distribution. But what if we’re not just interested in whether something is present - what if we want to know how many are present?
Count data are everywhere in ecology:
- Number of individuals observed in a survey plot
- Number of offspring produced by a parent
- Number of species in a community
- Number of flowers on a plant
- Number of territorial displays observed
- Number of seeds germinating from a seed bank
- Number of parasites on a host
Unlike continuous measurements (which can take any value) or binary outcomes (0 or 1), counts are: - Non-negative integers: 0, 1, 2, 3, … (never negative, never fractional) - Discrete: You can’t have 2.5 individuals - Often right-skewed: Many zeros and small values, few large values
In Chapter 2, we learned that the GLM framework allows us to choose appropriate distributions for different data types. For count data, we use the Poisson distribution with a log link function.
Let’s see why linear regression fails for count data, just as it failed for binary data.
4.2 The Problem with Linear Regression for Count Data
4.2.1 The Dataset
We’ll analyze data from a bird point count survey in restored prairie sites. Researchers conducted standardized point counts to understand how grassland bird abundance responds to restoration age and habitat characteristics.
Dataset: prairie_birds.csv
Variables: - bird_count: Number of grassland birds detected at the point (count) - restoration_age: Years since restoration (0 = unrestored agriculture, up to 20 years) - native_plant_cover: Percent cover of native plant species (0-100%) - patch_size: Size of the prairie patch in hectares
Study context: Prairie restoration is a key conservation strategy in the Midwest. Understanding how bird abundance responds to restoration helps managers prioritize sites and predict recovery trajectories.
4.2.2 Loading and Exploring the Data
# Load the data
bird_data <- read.csv("data/prairie_birds.csv")
# Examine structure
str(bird_data)'data.frame': 150 obs. of 4 variables:
$ bird_count : int 30 10 23 33 17 6 8 8 17 12 ...
$ restoration_age : num 5.8 15.8 8.2 17.7 18.8 0.9 10.6 17.8 11 9.1 ...
$ native_plant_cover: num 84.7 49.8 38.8 24.6 11.1 39 57.2 21.7 44.5 21.8 ...
$ patch_size : num 39.4 1.5 39.2 36.7 31.9 24.6 8.7 1.4 23.2 25.1 ...
# Summary statistics
summary(bird_data) bird_count restoration_age native_plant_cover patch_size
Min. : 0.00 Min. : 0.00 Min. : 0.60 Min. : 1.40
1st Qu.: 9.25 1st Qu.: 4.90 1st Qu.:26.50 1st Qu.:13.10
Median :17.00 Median : 9.55 Median :48.15 Median :24.10
Mean :21.41 Mean :10.09 Mean :49.50 Mean :25.09
3rd Qu.:27.00 3rd Qu.:15.10 3rd Qu.:72.08 3rd Qu.:36.85
Max. :78.00 Max. :19.90 Max. :98.60 Max. :50.00
# Check for missing values
cat("\nMissing values:\n")
Missing values:
colSums(is.na(bird_data)) bird_count restoration_age native_plant_cover patch_size
0 0 0 0
# Sample size
cat("\nTotal observations:", nrow(bird_data), "\n")
Total observations: 150
# Distribution of counts
cat("Bird count distribution:\n")Bird count distribution:
table(bird_data$bird_count)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
2 1 1 2 6 4 7 6 6 3 4 1 9 5 6 4 6 4 3 4 4 7 6 2 3 3
27 30 31 33 34 35 36 37 38 40 41 42 43 44 46 49 50 55 59 60 62 66 70 78
4 3 1 4 1 1 1 2 2 1 1 1 3 1 1 2 2 2 1 1 3 1 1 1
cat("\nSummary statistics for bird counts:\n")
Summary statistics for bird counts:
cat("Mean:", round(mean(bird_data$bird_count), 2), "\n")Mean: 21.41
cat("Variance:", round(var(bird_data$bird_count), 2), "\n")Variance: 264.69
cat("Variance/Mean ratio:", round(var(bird_data$bird_count) / mean(bird_data$bird_count), 2), "\n")Variance/Mean ratio: 12.36
cat("Range:", range(bird_data$bird_count), "\n")Range: 0 78
From our exploration:
Sample characteristics: - 150 point count surveys conducted - Bird counts range from 0 to 78 individuals - Mean count = 21.41, variance = 264.69 - Variance/mean ratio = 12.36 (close to 1 suggests Poisson may be appropriate) - No missing data
Distribution: - Many sites have low counts (0-5 birds) - Few sites have high counts (>10 birds) - Right-skewed distribution typical of count data
# Plot 1: Histogram of counts
p1 <- ggplot(bird_data, aes(x = bird_count)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
labs(title = "Distribution of Bird Counts",
x = "Number of Birds",
y = "Frequency") +
theme_minimal()
# Plot 2: Counts vs restoration age
p2 <- ggplot(bird_data, aes(x = restoration_age, y = bird_count)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE, color = "darkgreen") +
labs(title = "Bird Counts vs. Restoration Age",
x = "Restoration Age (years)",
y = "Number of Birds") +
theme_minimal()
# Plot 3: Counts vs native plant cover
p3 <- ggplot(bird_data, aes(x = native_plant_cover, y = bird_count)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE, color = "darkblue") +
labs(title = "Bird Counts vs. Native Plant Cover",
x = "Native Plant Cover (%)",
y = "Number of Birds") +
theme_minimal()
# Plot 4: Counts vs patch size
p4 <- ggplot(bird_data, aes(x = patch_size, y = bird_count)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE, color = "purple") +
labs(title = "Bird Counts vs. Patch Size",
x = "Patch Size (ha)",
y = "Number of Birds") +
theme_minimal()
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Distribution: Count data are right-skewed with many low values and few high values
Restoration age: Bird abundance appears to increase with restoration age, with steeper increases in early years
Native plant cover: Positive relationship - more native plants associated with more birds
Patch size: Larger patches tend to support more birds, though relationship may be non-linear
Zeros: Several sites have zero birds detected - this is ecologically meaningful (habitat not suitable yet)
These patterns suggest restoration and habitat quality drive bird abundance!
4.2.3 Why Linear Regression Fails for Counts
Let’s see what happens when we try to use linear regression:
# Fit linear regression (wrong!)
linear_model <- lm(bird_count ~ restoration_age, data = bird_data)
# Create predictions
pred_grid <- data.frame(restoration_age = seq(0, 20, length.out = 100))
pred_grid$linear_pred <- predict(linear_model, newdata = pred_grid)
# Plot
ggplot(bird_data, aes(x = restoration_age, y = bird_count)) +
geom_point(alpha = 0.5) +
geom_line(data = pred_grid, aes(y = linear_pred), color = "red", size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
labs(
title = "Linear Regression Applied to Count Data (WRONG!)",
subtitle = "Red line shows predictions can go negative (impossible for counts!)",
x = "Restoration Age (years)",
y = "Bird Count"
) +
theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
par(mfrow = c(2, 2))
plot(linear_model)par(mfrow = c(1, 1))Predictions can be negative: The red line goes below zero at low restoration ages. You can’t have negative birds!
Wrong variance structure: Count data variance often increases with the mean. Linear regression assumes constant variance.
Wrong distribution: Residuals aren’t normally distributed - they’re skewed because counts are bounded at zero.
Heteroscedasticity: The diagnostic plots show clear fanning pattern - variance increases with fitted values.
These violations mean our p-values, confidence intervals, and predictions are all wrong!
4.3 The Poisson Distribution: Made for Count Data
The Poisson distribution is specifically designed for count data. It has one parameter, λ (lambda), which is both the mean AND the variance.
# Create Poisson distributions with different lambdas
lambda_values <- c(1, 3, 7, 12)
poisson_data <- expand.grid(
count = 0:25,
lambda = lambda_values
) %>%
mutate(
probability = dpois(count, lambda),
lambda_label = paste0("λ = ", lambda)
)
ggplot(poisson_data, aes(x = count, y = probability)) +
geom_bar(stat = "identity", fill = "steelblue", color = "black") +
facet_wrap(~lambda_label, scales = "free_y") +
labs(
title = "Poisson Distribution for Different Mean Values (λ)",
subtitle = "Notice how the distribution shifts right and spreads as λ increases",
x = "Count",
y = "Probability"
) +
theme_minimal()For count data: Only non-negative integers (0, 1, 2, 3, …)
One parameter: λ (lambda) determines both mean and variance
- E(Y) = λ
- Var(Y) = λ
- This is called equidispersion (mean = variance)
Shape changes with λ:
- Small λ: Right-skewed, many zeros
- Large λ: More symmetric, approaches normal
Ecological interpretation: λ is the expected count (average number of events)
Important: The assumption that mean = variance is strict and often violated in ecological data. We’ll address this in Chapter 5 with negative binomial regression.
4.4 Poisson Regression: The GLM Approach
Poisson regression solves the problems with linear regression by using: 1. The Poisson distribution (appropriate for counts) 2. The log link function (ensures predictions are positive) 3. A model for the expected count as a function of predictors
The Poisson regression model is:
\[ \begin{align} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots \end{align} \]
Where: - \(Y_i\) is the observed count for observation \(i\) - \(\lambda_i\) is the expected count (Poisson parameter) - The log link ensures \(\lambda_i\) is always positive
4.4.1 The Log Link Function
# Visualize the log link
linear_pred <- seq(-3, 3, length.out = 100)
expected_count <- exp(linear_pred) # Inverse of log link
link_df <- data.frame(
linear_predictor = linear_pred,
expected_count = expected_count
)
ggplot(link_df, aes(x = linear_predictor, y = expected_count)) +
geom_line(color = "darkgreen", size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Log Link Function: Linear Predictor to Expected Count",
subtitle = "Exponential transformation ensures counts are always positive",
x = "Linear Predictor (β₀ + β₁X₁ + ...)",
y = "Expected Count (λ)"
) +
theme_minimal()On the link scale (log): - \(\log(\lambda) = \beta_0 + \beta_1 X_1 + \ldots\) - Coefficients (β) represent changes in log count - Model is additive on the log scale
On the response scale (count): - \(\lambda = e^{\beta_0 + \beta_1 X_1 + \ldots}\) - Exponentiating gives us the expected count - Model is multiplicative on the count scale
Key insight: A one-unit increase in X₁ multiplies the expected count by \(e^{\beta_1}\)
This is why we interpret Poisson coefficients as rate ratios after exponentiating!
4.5 Fitting Poisson Regression in R
Let’s fit a Poisson regression to our bird count data:
# Fit Poisson regression model
bird_poisson <- glm(bird_count ~ restoration_age + native_plant_cover + patch_size,
family = poisson(link = "log"),
data = bird_data)
# Summary
summary(bird_poisson)
Call:
glm(formula = bird_count ~ restoration_age + native_plant_cover +
patch_size, family = poisson(link = "log"), data = bird_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5646230 0.0759830 7.431 1.08e-13 ***
restoration_age 0.0760829 0.0032618 23.325 < 2e-16 ***
native_plant_cover 0.0149407 0.0006847 21.822 < 2e-16 ***
patch_size 0.0293899 0.0013371 21.981 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1701.62 on 149 degrees of freedom
Residual deviance: 181.39 on 146 degrees of freedom
AIC: 876.66
Number of Fisher Scoring iterations: 4
Let’s break down the key components:
Deviance Residuals: - Should be roughly symmetric around zero - Large values indicate poorly fitted observations
Coefficients (on log scale): - restoration_age: β = 0.0761 - native_plant_cover: β = 0.0149 - patch_size: β = 0.0294
All coefficients are positive and significant, suggesting all three factors increase bird abundance.
Deviance: - Null deviance: 1701.6 (model with intercept only) - Residual deviance: 181.4 (our fitted model) - Reduction: 1520.2 - substantial improvement!
AIC: 876.7 - useful for model comparison
But what do these coefficients actually mean? Let’s explore interpretation next.
4.6 Interpreting Poisson Regression Coefficients
The coefficients are on the log scale, which makes direct interpretation challenging. We need to exponentiate them to get rate ratios (also called incidence rate ratios or IRRs).
# Extract coefficients
coefs <- coef(bird_poisson)
# Calculate rate ratios (exponentiate coefficients)
rate_ratios <- exp(coefs)
# Create interpretation table
rr_table <- data.frame(
Variable = names(coefs),
Coefficient = round(coefs, 4),
Rate_Ratio = round(rate_ratios, 4),
Percent_Change = round((rate_ratios - 1) * 100, 2)
)
print(rr_table) Variable Coefficient Rate_Ratio Percent_Change
(Intercept) (Intercept) 0.5646 1.7588 75.88
restoration_age restoration_age 0.0761 1.0791 7.91
native_plant_cover native_plant_cover 0.0149 1.0151 1.51
patch_size patch_size 0.0294 1.0298 2.98
restoration_age (RR = 1.079): - For each additional year of restoration, the expected bird count is multiplied by 1.079 - This is a 7.9% increase per year - Over 10 years: count multiplied by 2.14
native_plant_cover (RR = 1.015): - Each 1% increase in native cover multiplies bird count by 1.015 - This is a 1.51% increase per 1% cover - A 10% increase in cover: count multiplied by 1.161
patch_size (RR = 1.03): - Each additional hectare multiplies bird count by 1.03 - This is a 2.98% increase per hectare
Key insight: Rate ratios > 1 indicate positive effects (count increases), rate ratios < 1 indicate negative effects (count decreases), and a rate ratio = 1 means no effect.
4.6.1 Confidence Intervals for Rate Ratios
# Get confidence intervals
conf_ints <- confint(bird_poisson)Waiting for profiling to be done...
# Exponentiate for rate ratio scale
rr_ci_table <- exp(cbind(
Rate_Ratio = coef(bird_poisson),
conf_ints
))
print(round(rr_ci_table, 3)) Rate_Ratio 2.5 % 97.5 %
(Intercept) 1.759 1.514 2.039
restoration_age 1.079 1.072 1.086
native_plant_cover 1.015 1.014 1.016
patch_size 1.030 1.027 1.033
The 95% confidence intervals tell us the range of plausible values for the rate ratios.
For restoration_age: - 95% CI: [1.072, 1.086] - We’re 95% confident the true rate ratio lies in this range - Since the CI doesn’t include 1, the effect is statistically significant - All values > 1, confirming a positive effect
Similar interpretation for the other predictors - all CIs exclude 1, confirming significant positive effects.
4.7 Visualizing Poisson Regression Predictions
While rate ratios tell us about multiplicative effects, visualizing predictions on the count scale is often more intuitive:
# Create prediction grids for each variable
grid_age <- data.frame(
restoration_age = seq(0, 20, length.out = 100),
native_plant_cover = mean(bird_data$native_plant_cover),
patch_size = mean(bird_data$patch_size)
)
grid_cover <- data.frame(
restoration_age = mean(bird_data$restoration_age),
native_plant_cover = seq(0, 100, length.out = 100),
patch_size = mean(bird_data$patch_size)
)
grid_size <- data.frame(
restoration_age = mean(bird_data$restoration_age),
native_plant_cover = mean(bird_data$native_plant_cover),
patch_size = seq(min(bird_data$patch_size), max(bird_data$patch_size), length.out = 100)
)
# Get predictions with confidence intervals
get_poisson_predictions <- function(model, newdata) {
pred <- predict(model, newdata = newdata, type = "link", se.fit = TRUE)
newdata$count <- exp(pred$fit)
newdata$lower <- exp(pred$fit - 1.96 * pred$se.fit)
newdata$upper <- exp(pred$fit + 1.96 * pred$se.fit)
return(newdata)
}
grid_age <- get_poisson_predictions(bird_poisson, grid_age)
grid_cover <- get_poisson_predictions(bird_poisson, grid_cover)
grid_size <- get_poisson_predictions(bird_poisson, grid_size)
# Create plots
p1 <- ggplot(grid_age, aes(x = restoration_age, y = count)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "steelblue") +
geom_line(color = "steelblue", size = 1.2) +
labs(
title = "Effect of Restoration Age",
x = "Restoration Age (years)",
y = "Expected Bird Count"
) +
theme_minimal()
p2 <- ggplot(grid_cover, aes(x = native_plant_cover, y = count)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "darkgreen") +
geom_line(color = "darkgreen", size = 1.2) +
labs(
title = "Effect of Native Plant Cover",
x = "Native Plant Cover (%)",
y = "Expected Bird Count"
) +
theme_minimal()
p3 <- ggplot(grid_size, aes(x = patch_size, y = count)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "purple") +
geom_line(color = "purple", size = 1.2) +
labs(
title = "Effect of Patch Size",
x = "Patch Size (ha)",
y = "Expected Bird Count"
) +
theme_minimal()
gridExtra::grid.arrange(p1, p2, p3, ncol = 2)Restoration age effect: - Expected bird count increases exponentially with restoration age - New restorations (0-5 years) have low abundance (< 5 birds) - Mature restorations (15-20 years) support much higher abundance (> 10 birds) - Management implication: Be patient - restoration takes time!
Native plant cover effect: - Strong positive relationship throughout the range - Sites with < 20% native cover: < 5 birds expected - Sites with > 80% native cover: > 12 birds expected - Management implication: Prioritize increasing native plant diversity
Patch size effect: - Larger patches support more birds, as expected from species-area relationships - Effect is approximately linear on the count scale - Small patches (< 5 ha) support few birds - Management implication: Protect and restore large contiguous patches when possible
All three factors work together - the best sites for grassland birds are large, old restorations with high native plant cover!
4.8 Model Diagnostics for Poisson Regression
4.8.1 Deviance and Model Fit
# Extract deviance information
null_dev <- bird_poisson$null.deviance
resid_dev <- bird_poisson$deviance
df_null <- bird_poisson$df.null
df_resid <- bird_poisson$df.residual
# Calculate pseudo R-squared
pseudo_r2 <- 1 - (resid_dev / null_dev)
cat("Null Deviance:", round(null_dev, 2), "on", df_null, "df\n")Null Deviance: 1701.62 on 149 df
cat("Residual Deviance:", round(resid_dev, 2), "on", df_resid, "df\n")Residual Deviance: 181.39 on 146 df
cat("Reduction:", round(null_dev - resid_dev, 2), "\n")Reduction: 1520.23
cat("Pseudo R²:", round(pseudo_r2, 4), "\n\n")Pseudo R²: 0.8934
# Likelihood ratio test
lr_stat <- null_dev - resid_dev
lr_df <- df_null - df_resid
lr_p <- pchisq(lr_stat, lr_df, lower.tail = FALSE)
cat("Likelihood Ratio Test:\n")Likelihood Ratio Test:
cat("Chi-squared =", round(lr_stat, 2), "on", lr_df, "df\n")Chi-squared = 1520.23 on 3 df
cat("p-value < 0.001\n")p-value < 0.001
Deviance measures how well the model fits the data: - Lower deviance = better fit - Reduction from null to residual deviance shows improvement from adding predictors - Our model reduces deviance by 89.3%
Pseudo R²: 0.893 suggests decent explanatory power
Likelihood ratio test: Highly significant (p < 0.001), confirming predictors meaningfully improve the model
4.8.2 Checking for Overdispersion
A critical assumption of Poisson regression is that mean = variance. Let’s check:
# Calculate dispersion parameter
dispersion <- resid_dev / df_resid
cat("Dispersion parameter:", round(dispersion, 3), "\n")Dispersion parameter: 1.242
cat("\nInterpretation:\n")
Interpretation:
if (dispersion < 1) {
cat(" Dispersion < 1: Possible underdispersion\n")
} else if (dispersion <= 1.5) {
cat(" Dispersion ≈ 1: Poisson assumption reasonable\n")
} else if (dispersion <= 3) {
cat(" Dispersion > 1: Mild overdispersion\n")
} else {
cat(" Dispersion >> 1: Substantial overdispersion - consider negative binomial!\n")
} Dispersion ≈ 1: Poisson assumption reasonable
# Formal test
library(AER)Warning: package 'AER' was built under R version 4.3.3
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
disptest_result <- dispersiontest(bird_poisson)
print(disptest_result)
Overdispersion test
data: bird_poisson
z = 1.193, p-value = 0.1164
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.148533
Overdispersion occurs when variance > mean (violating the Poisson assumption that mean = variance).
Signs of overdispersion: - Dispersion parameter > 1 (residual deviance / df > 1) - Significant dispersion test - Residual deviance >> degrees of freedom
Consequences if ignored: - Standard errors are too small (underestimated uncertainty) - P-values are too small (false positives) - Confidence intervals are too narrow - You might claim significance when there isn’t any!
Common causes in ecology: - Individual heterogeneity - Spatial clustering (contagion) - Missing important predictors - Temporal correlation
What to do: - If dispersion ≈ 1-1.5: Poisson is okay - If dispersion > 1.5-2: Consider quasi-Poisson (adjusts SEs) - If dispersion > 2-3: Use negative binomial regression (Chapter 5)
Our model: Dispersion = 1.24. Poisson seems appropriate!
4.8.3 Residual Plots
# Calculate residuals
bird_data$pearson_resid <- residuals(bird_poisson, type = "pearson")
bird_data$deviance_resid <- residuals(bird_poisson, type = "deviance")
bird_data$fitted <- fitted(bird_poisson)
# Create diagnostic plots
p1 <- ggplot(bird_data, aes(x = fitted, y = deviance_resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_smooth(se = FALSE, color = "blue") +
labs(
title = "Deviance Residuals vs. Fitted",
x = "Fitted Values",
y = "Deviance Residuals"
) +
theme_minimal()
p2 <- ggplot(bird_data, aes(sample = deviance_resid)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(
title = "Q-Q Plot",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()
p3 <- ggplot(bird_data, aes(x = seq_along(deviance_resid), y = deviance_resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_hline(yintercept = c(-3, 3), linetype = "dashed", color = "orange") +
labs(
title = "Index Plot",
x = "Observation Number",
y = "Deviance Residuals"
) +
theme_minimal()
p4 <- ggplot(bird_data, aes(x = fitted, y = sqrt(abs(deviance_resid)))) +
geom_point(alpha = 0.6) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Scale-Location Plot",
x = "Fitted Values",
y = "√|Deviance Residuals|"
) +
theme_minimal()
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Residuals vs. Fitted: - Look for random scatter around zero - The blue smoothed line should be flat - Patterns suggest missing predictors or wrong model form
Q-Q Plot: - Points should follow the diagonal line - Count data shows some discreteness, so perfect normality not expected - Major deviations in tails are concerning
Index Plot: - Identifies outliers (beyond ±3) - Look for patterns suggesting temporal or spatial structure
Scale-Location: - Checks for constant variance - For Poisson, some increase with fitted values is expected - Strong patterns suggest overdispersion
Our model: Diagnostics look reasonable, supporting the Poisson model choice.
4.9 Offsets: Accounting for Different Exposure Times
Sometimes counts come from different sampling efforts, time periods, or area sizes. We use offsets to account for this.
Example: If one site was surveyed for 2 hours and another for 1 hour, we expect twice as many birds at the first site just due to effort!
# Hypothetical example with survey effort
# If you had a variable 'survey_hours' in your data:
bird_poisson_offset <- glm(bird_count ~ restoration_age + offset(log(survey_hours)),
family = poisson,
data = bird_data)Offset = a predictor with coefficient fixed at 1
Without offset: \[\log(\lambda) = \beta_0 + \beta_1 X_1\]
With offset: \[\log(\lambda) = \log(\text{effort}) + \beta_0 + \beta_1 X_1\]
Which rearranges to: \[\log\left(\frac{\lambda}{\text{effort}}\right) = \beta_0 + \beta_1 X_1\]
So we’re modeling the rate (counts per unit effort), not raw counts!
When to use offsets: - Survey effort varies (time, area, traps) - Population sizes differ (counts per capita) - Exposure times vary (person-years in epidemiology)
The offset variable should be log-transformed because we’re working on the log scale.
4.10 Model Comparison and Selection
Let’s compare several candidate models:
# Fit candidate models
m_null <- glm(bird_count ~ 1, family = poisson, data = bird_data)
m_age <- glm(bird_count ~ restoration_age, family = poisson, data = bird_data)
m_age_cover <- glm(bird_count ~ restoration_age + native_plant_cover,
family = poisson, data = bird_data)
m_full <- glm(bird_count ~ restoration_age + native_plant_cover + patch_size,
family = poisson, data = bird_data)
m_interaction <- glm(bird_count ~ restoration_age * native_plant_cover + patch_size,
family = poisson, data = bird_data)
# Create comparison table
models <- list(
"Null" = m_null,
"Age Only" = m_age,
"Age + Cover" = m_age_cover,
"Full (Age + Cover + Size)" = m_full,
"With Interaction" = m_interaction
)
aic_table <- data.frame(
Model = names(models),
AIC = sapply(models, AIC),
Deviance = sapply(models, deviance),
df = sapply(models, function(x) x$df.residual)
)
aic_table$Delta_AIC <- aic_table$AIC - min(aic_table$AIC)
aic_table$AIC_weight <- exp(-0.5 * aic_table$Delta_AIC) / sum(exp(-0.5 * aic_table$Delta_AIC))
# Sort by AIC
aic_table <- aic_table[order(aic_table$AIC), ]
print(aic_table) Model AIC Deviance df
Full (Age + Cover + Size) Full (Age + Cover + Size) 876.6576 181.3930 146
With Interaction With Interaction 878.5934 181.3287 145
Age + Cover Age + Cover 1377.7961 684.5314 147
Age Only Age Only 1998.2117 1306.9470 148
Null Null 2390.8844 1701.6197 149
Delta_AIC AIC_weight
Full (Age + Cover + Size) 0.000000 7.246953e-01
With Interaction 1.935745 2.753047e-01
Age + Cover 501.138466 1.094763e-109
Age Only 1121.554067 2.078664e-244
Null 1514.226752 0.000000e+00
Best model: Full (Age + Cover + Size) (lowest AIC)
AIC weight: 0.725 - suggests 72.5% probability this is the best model
Second best: With Interaction (ΔAIC = 1.94)
Interpretation: - Models within ΔAIC < 2 have substantial support - Models with ΔAIC > 10 have essentially no support - The close competition suggests both models are plausible
Ecological conclusion: An additive model is sufficient - no strong evidence for interactions
4.11 A Complete Example: Seed Predation Study
Let’s work through another example from start to finish.
4.11.1 The Dataset
Researchers studied seed predation by rodents in forest fragments. They placed seed trays at different distances from forest edges and counted how many seeds were removed.
Dataset: seed_predation.csv
Variables: - seeds_removed: Number of seeds removed from tray (out of 50) - distance_m: Distance from forest edge in meters - canopy_density: Percent canopy closure above the trap - season: Season when trap was set (spring, summer, fall)
# Load data
seed_data <- read.csv("data/seed_predation.csv")
# Structure and summary
str(seed_data)'data.frame': 120 obs. of 4 variables:
$ seeds_removed : int 10 11 4 2 4 3 7 6 5 6 ...
$ distance_m : num 9 21.1 73.3 85.2 78.8 33.2 8.2 28.6 23.8 38.5 ...
$ canopy_density: num 20.9 24 32.2 71.8 51.5 79.8 81.1 58.3 56.2 25.8 ...
$ season : chr "spring" "summer" "summer" "fall" ...
summary(seed_data) seeds_removed distance_m canopy_density season
Min. : 0.000 Min. : 0.20 Min. :20.90 Length:120
1st Qu.: 3.000 1st Qu.:32.20 1st Qu.:36.30 Class :character
Median : 4.000 Median :54.85 Median :57.55 Mode :character
Mean : 4.925 Mean :54.93 Mean :57.63
3rd Qu.: 7.000 3rd Qu.:79.35 3rd Qu.:79.47
Max. :14.000 Max. :99.30 Max. :94.30
# Convert season to factor
seed_data$season <- factor(seed_data$season,
levels = c("spring", "summer", "fall"))# Visualize
p1 <- ggplot(seed_data, aes(x = distance_m, y = seeds_removed)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", color = "darkred") +
labs(title = "Seeds Removed vs. Distance from Edge",
x = "Distance from Edge (m)",
y = "Seeds Removed") +
theme_minimal()
p2 <- ggplot(seed_data, aes(x = season, y = seeds_removed, fill = season)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(title = "Seed Predation by Season",
x = "Season",
y = "Seeds Removed") +
theme_minimal() +
theme(legend.position = "none")
gridExtra::grid.arrange(p1, p2, ncol = 2)`geom_smooth()` using formula = 'y ~ x'
# Fit Poisson regression
seed_model <- glm(seeds_removed ~ distance_m + canopy_density + season,
family = poisson,
data = seed_data)
summary(seed_model)
Call:
glm(formula = seeds_removed ~ distance_m + canopy_density + season,
family = poisson, data = seed_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.60927 0.13641 19.129 < 2e-16 ***
distance_m -0.01089 0.00148 -7.354 1.93e-13 ***
canopy_density -0.01108 0.00187 -5.929 3.05e-09 ***
seasonsummer 0.33661 0.09812 3.431 0.000602 ***
seasonfall 0.06044 0.10840 0.558 0.577167
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 222.66 on 119 degrees of freedom
Residual deviance: 124.90 on 115 degrees of freedom
AIC: 524.92
Number of Fisher Scoring iterations: 4
# Rate ratios
exp(coef(seed_model)) (Intercept) distance_m canopy_density seasonsummer seasonfall
13.5891736 0.9891738 0.9889765 1.4002005 1.0623018
Distance from edge (RR = 0.989): - Seeds removal DECREASES with distance from edge - Each additional meter multiplies expected removal by 0.989 - Edge effect - predators more active near edges
Canopy density: - Denser canopy reduces predation - Makes sense - rodents may avoid open areas
Season: - Summer shows higher predation than spring - Fall shows higher predation than spring - Seasonal variation in rodent activity
4.12 Connecting to Future Topics
Poisson regression is the foundation for understanding more complex count models:
Chapter 5: Negative Binomial Regression - Extends Poisson to handle overdispersion (when variance > mean) - Adds a dispersion parameter for flexibility - Same interpretation as Poisson, just more flexible
Chapter 6: Zero-Inflated Models - For counts with excess zeros - Combines logistic regression (from Chapter 3) with Poisson/negative binomial - Two-part models: zero vs. non-zero, then count
The progression: Poisson → Negative Binomial (overdispersion) → Zero-Inflated (excess zeros)
All use the log link and rate ratio interpretation you’ve learned here!
4.13 Exercise: Analyzing Plant Seedling Counts
Now it’s your turn! Analyze seedling recruitment data from a reforestation study.
# Load seedling recruitment data
seedling_recruitment <- read.csv("data/seedling_recruitment.csv")
# Variables:
# - seedling_count: Number of seedlings established in plot
# - light_availability: Percent canopy openness
# - soil_nutrients: Soil nutrient index
# - distance_to_seed_source: Distance to nearest mature tree (m)
# Your tasks:
# 1. Explore the data - create histograms and scatterplots
# [YOUR CODE HERE]
# 2. Check mean and variance - is Poisson appropriate?
# [YOUR CODE HERE]
# 3. Fit a Poisson regression model
# [YOUR CODE HERE]
# 4. Interpret coefficients as rate ratios
# [YOUR CODE HERE]
# 5. Check for overdispersion
# [YOUR CODE HERE]
# 6. Create prediction plots
# [YOUR CODE HERE]
# 7. Compare candidate models with AIC
# [YOUR CODE HERE]Start with basic descriptives:
summary(seedling_recruitment$seedling_count)
var(seedling_recruitment$seedling_count) / mean(seedling_recruitment$seedling_count)Variance/mean ratio near 1 suggests Poisson is appropriate.
For rate ratios:
exp(coef(your_model))
exp(confint(your_model))Remember: RR > 1 = positive effect, RR < 1 = negative effect
Check overdispersion:
deviance(your_model) / df.residual(your_model)Value > 1.5-2 suggests overdispersion.
Solution.
library(tidyverse)
library(AER)
# 1. Load and explore
seedling_recruitment <- read.csv("data/seedling_recruitment.csv")
str(seedling_recruitment)
summary(seedling_recruitment)
# Distribution
hist(seedling_recruitment$seedling_count, breaks = 20,
main = "Distribution of Seedling Counts",
xlab = "Number of Seedlings")
# Scatterplots
par(mfrow = c(2, 2))
plot(seedling_count ~ light_availability, data = seedling_recruitment)
plot(seedling_count ~ soil_nutrients, data = seedling_recruitment)
plot(seedling_count ~ distance_to_seed_source, data = seedling_recruitment)
par(mfrow = c(1, 1))
# 2. Check mean vs variance
mean_count <- mean(seedling_recruitment$seedling_count)
var_count <- var(seedling_recruitment$seedling_count)
cat("Mean:", mean_count, "\n")
cat("Variance:", var_count, "\n")
cat("Variance/Mean ratio:", var_count / mean_count, "\n")
# 3. Fit Poisson model
seedling_model <- glm(seedling_count ~ light_availability + soil_nutrients + distance_to_seed_source,
family = poisson,
data = seedling_recruitment)
summary(seedling_model)
# 4. Rate ratios
cat("\n=== Rate Ratios ===\n")
rr <- exp(coef(seedling_model))
rr_ci <- exp(confint(seedling_model))
print(cbind(Rate_Ratio = rr, rr_ci))
# Interpretation
cat("\nInterpretation:\n")
cat("Light: Each 1% increase multiplies seedling count by", round(rr["light_availability"], 3), "\n")
cat("Nutrients: Each unit increase multiplies count by", round(rr["soil_nutrients"], 3), "\n")
cat("Distance: Each meter further multiplies count by", round(rr["distance_to_seed_source"], 3), "\n")
# 5. Check overdispersion
dispersion <- deviance(seedling_model) / df.residual(seedling_model)
cat("\nDispersion parameter:", round(dispersion, 3), "\n")
if (dispersion > 2) {
cat("WARNING: Substantial overdispersion detected!\n")
cat("Consider using negative binomial regression (Chapter 5)\n")
}
# Formal test
disptest <- dispersiontest(seedling_model)
print(disptest)
# 6. Prediction plots
pred_grid <- expand.grid(
light_availability = seq(min(seedling_recruitment$light_availability),
max(seedling_recruitment$light_availability),
length.out = 100),
soil_nutrients = mean(seedling_recruitment$soil_nutrients),
distance_to_seed_source = mean(seedling_recruitment$distance_to_seed_source)
)
pred_grid$predicted <- predict(seedling_model, newdata = pred_grid, type = "response")
ggplot() +
geom_point(data = seedling_recruitment,
aes(x = light_availability, y = seedling_count),
alpha = 0.5) +
geom_line(data = pred_grid,
aes(x = light_availability, y = predicted),
color = "blue", size = 1.2) +
labs(title = "Effect of Light on Seedling Recruitment",
x = "Light Availability (%)",
y = "Expected Seedling Count") +
theme_minimal()
# 7. Model comparison
m1 <- glm(seedling_count ~ light_availability, family = poisson, data = seedling_recruitment)
m2 <- glm(seedling_count ~ light_availability + soil_nutrients, family = poisson, data = seedling_recruitment)
m3 <- seedling_model
m4 <- glm(seedling_count ~ light_availability * soil_nutrients + distance_to_seed_source,
family = poisson, data = seedling_recruitment)
AIC(m1, m2, m3, m4)4.14 Key Takeaways
- Poisson regression is for count data (non-negative integers)
- Uses Poisson distribution (mean = variance = λ)
- Uses log link to ensure positive predictions
- Coefficients are on log scale; exponentiate for rate ratios
- Rate ratios show multiplicative effects on expected counts
- Overdispersion (variance > mean) is common - check for it!
- If overdispersed, use quasi-Poisson or negative binomial (Chapter 5)
- Offsets account for different exposures/efforts
- Same model comparison tools as logistic regression (AIC, likelihood ratio tests)
4.15 Looking Ahead
In Chapter 5, we’ll learn about negative binomial regression, which extends Poisson to handle overdispersed count data by adding a dispersion parameter. This is one of the most commonly used models in ecology!
4.16 Additional Resources
Zuur, A. F., Ieno, E. N., Walker, N. J., Saveliev, A. A., & Smith, G. M. (2009). Mixed Effects Models and Extensions in Ecology with R. Springer. [Chapter 4 on GLMs for count data]
Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
Ver Hoef, J. M., & Boveng, P. L. (2007). Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology, 88(11), 2766-2772.
Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.
Bolker, B. M. (2008). Ecological Models and Data in R. Princeton University Press. [Chapter 9 on count data]