Chapter 4: Poisson Regression for Count Data

Author

Steffen Foerster

4.1 From Probabilities to Counts: When Your Response is a Whole Number

In Chapter 3, we learned how to model binary data (presence/absence) using logistic regression with the binomial distribution. But what if we’re not just interested in whether something is present - what if we want to know how many are present?

Count data are everywhere in ecology:

Common count responses in ecology and evolution

Number of individuals observed in a survey plot
Number of offspring produced by a parent
Number of species in a community
Number of flowers on a plant
Number of territorial displays observed
Number of seeds germinating from a seed bank
Number of parasites on a host

Unlike continuous measurements (which can take any value) or binary outcomes (0 or 1), counts are: - Non-negative integers: 0, 1, 2, 3, … (never negative, never fractional) - Discrete: You can’t have 2.5 individuals - Often right-skewed: Many zeros and small values, few large values

In Chapter 2, we learned that the GLM framework allows us to choose appropriate distributions for different data types. For count data, we use the Poisson distribution with a log link function.

Let’s see why linear regression fails for count data, just as it failed for binary data.

library(tidyverse)
library(MASS)        # for additional GLM functions
library(performance) # for model comparison
library(car)         # for diagnostics

4.2 The Problem with Linear Regression for Count Data

4.2.1 The Dataset

We’ll analyze data from a bird point count survey in restored prairie sites. Researchers conducted standardized point counts to understand how grassland bird abundance responds to restoration age and habitat characteristics.

Dataset: prairie_birds.csv

Variables: - bird_count: Number of grassland birds detected at the point (count) - restoration_age: Years since restoration (0 = unrestored agriculture, up to 20 years) - native_plant_cover: Percent cover of native plant species (0-100%) - patch_size: Size of the prairie patch in hectares

Study context: Prairie restoration is a key conservation strategy in the Midwest. Understanding how bird abundance responds to restoration helps managers prioritize sites and predict recovery trajectories.

4.2.2 Loading and Exploring the Data

# Load the data
bird_data <- read.csv("data/prairie_birds.csv")

# Examine structure
str(bird_data)

'data.frame':   150 obs. of  4 variables:
 $ bird_count        : int  30 10 23 33 17 6 8 8 17 12 ...
 $ restoration_age   : num  5.8 15.8 8.2 17.7 18.8 0.9 10.6 17.8 11 9.1 ...
 $ native_plant_cover: num  84.7 49.8 38.8 24.6 11.1 39 57.2 21.7 44.5 21.8 ...
 $ patch_size        : num  39.4 1.5 39.2 36.7 31.9 24.6 8.7 1.4 23.2 25.1 ...

# Summary statistics
summary(bird_data)

   bird_count    restoration_age native_plant_cover   patch_size   
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.60      Min.   : 1.40  
 1st Qu.: 9.25   1st Qu.: 4.90   1st Qu.:26.50      1st Qu.:13.10  
 Median :17.00   Median : 9.55   Median :48.15      Median :24.10  
 Mean   :21.41   Mean   :10.09   Mean   :49.50      Mean   :25.09  
 3rd Qu.:27.00   3rd Qu.:15.10   3rd Qu.:72.08      3rd Qu.:36.85  
 Max.   :78.00   Max.   :19.90   Max.   :98.60      Max.   :50.00

# Check for missing values
cat("\nMissing values:\n")


Missing values:

colSums(is.na(bird_data))

        bird_count    restoration_age native_plant_cover         patch_size 
                 0                  0                  0                  0

# Sample size
cat("\nTotal observations:", nrow(bird_data), "\n")


Total observations: 150

# Distribution of counts
cat("Bird count distribution:\n")

Bird count distribution:

table(bird_data$bird_count)


 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 2  1  1  2  6  4  7  6  6  3  4  1  9  5  6  4  6  4  3  4  4  7  6  2  3  3 
27 30 31 33 34 35 36 37 38 40 41 42 43 44 46 49 50 55 59 60 62 66 70 78 
 4  3  1  4  1  1  1  2  2  1  1  1  3  1  1  2  2  2  1  1  3  1  1  1

cat("\nSummary statistics for bird counts:\n")


Summary statistics for bird counts:

cat("Mean:", round(mean(bird_data$bird_count), 2), "\n")

Mean: 21.41

cat("Variance:", round(var(bird_data$bird_count), 2), "\n")

Variance: 264.69

cat("Variance/Mean ratio:", round(var(bird_data$bird_count) / mean(bird_data$bird_count), 2), "\n")

Variance/Mean ratio: 12.36

cat("Range:", range(bird_data$bird_count), "\n")

Range: 0 78

Initial observations from data exploration

From our exploration:

Sample characteristics: - 150 point count surveys conducted - Bird counts range from 0 to 78 individuals - Mean count = 21.41, variance = 264.69 - Variance/mean ratio = 12.36 (close to 1 suggests Poisson may be appropriate) - No missing data

Distribution: - Many sites have low counts (0-5 birds) - Few sites have high counts (>10 birds) - Right-skewed distribution typical of count data

# Plot 1: Histogram of counts
p1 <- ggplot(bird_data, aes(x = bird_count)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Bird Counts",
       x = "Number of Birds",
       y = "Frequency") +
  theme_minimal()

# Plot 2: Counts vs restoration age
p2 <- ggplot(bird_data, aes(x = restoration_age, y = bird_count)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, color = "darkgreen") +
  labs(title = "Bird Counts vs. Restoration Age",
       x = "Restoration Age (years)",
       y = "Number of Birds") +
  theme_minimal()

# Plot 3: Counts vs native plant cover
p3 <- ggplot(bird_data, aes(x = native_plant_cover, y = bird_count)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, color = "darkblue") +
  labs(title = "Bird Counts vs. Native Plant Cover",
       x = "Native Plant Cover (%)",
       y = "Number of Birds") +
  theme_minimal()

# Plot 4: Counts vs patch size
p4 <- ggplot(bird_data, aes(x = patch_size, y = bird_count)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, color = "purple") +
  labs(title = "Bird Counts vs. Patch Size",
       x = "Patch Size (ha)",
       y = "Number of Birds") +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Exploratory visualizations of bird count data

Key patterns from visual exploration

Distribution: Count data are right-skewed with many low values and few high values
Restoration age: Bird abundance appears to increase with restoration age, with steeper increases in early years
Native plant cover: Positive relationship - more native plants associated with more birds
Patch size: Larger patches tend to support more birds, though relationship may be non-linear
Zeros: Several sites have zero birds detected - this is ecologically meaningful (habitat not suitable yet)

These patterns suggest restoration and habitat quality drive bird abundance!

4.2.3 Why Linear Regression Fails for Counts

Let’s see what happens when we try to use linear regression:

# Fit linear regression (wrong!)
linear_model <- lm(bird_count ~ restoration_age, data = bird_data)

# Create predictions
pred_grid <- data.frame(restoration_age = seq(0, 20, length.out = 100))
pred_grid$linear_pred <- predict(linear_model, newdata = pred_grid)

# Plot
ggplot(bird_data, aes(x = restoration_age, y = bird_count)) +
  geom_point(alpha = 0.5) +
  geom_line(data = pred_grid, aes(y = linear_pred), color = "red", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Linear Regression Applied to Count Data (WRONG!)",
    subtitle = "Red line shows predictions can go negative (impossible for counts!)",
    x = "Restoration Age (years)",
    y = "Bird Count"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Why linear regression fails for count data

par(mfrow = c(2, 2))
plot(linear_model)

Diagnostic plots reveal problems with linear regression

par(mfrow = c(1, 1))

Three critical problems with linear regression for count data

Predictions can be negative: The red line goes below zero at low restoration ages. You can’t have negative birds!
Wrong variance structure: Count data variance often increases with the mean. Linear regression assumes constant variance.
Wrong distribution: Residuals aren’t normally distributed - they’re skewed because counts are bounded at zero.
Heteroscedasticity: The diagnostic plots show clear fanning pattern - variance increases with fitted values.

These violations mean our p-values, confidence intervals, and predictions are all wrong!

4.3 The Poisson Distribution: Made for Count Data

The Poisson distribution is specifically designed for count data. It has one parameter, λ (lambda), which is both the mean AND the variance.

# Create Poisson distributions with different lambdas
lambda_values <- c(1, 3, 7, 12)
poisson_data <- expand.grid(
  count = 0:25,
  lambda = lambda_values
) %>%
  mutate(
    probability = dpois(count, lambda),
    lambda_label = paste0("λ = ", lambda)
  )

ggplot(poisson_data, aes(x = count, y = probability)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  facet_wrap(~lambda_label, scales = "free_y") +
  labs(
    title = "Poisson Distribution for Different Mean Values (λ)",
    subtitle = "Notice how the distribution shifts right and spreads as λ increases",
    x = "Count",
    y = "Probability"
  ) +
  theme_minimal()

The Poisson distribution at different mean values

Key properties of the Poisson distribution

For count data: Only non-negative integers (0, 1, 2, 3, …)
One parameter: λ (lambda) determines both mean and variance
- E(Y) = λ
- Var(Y) = λ
- This is called equidispersion (mean = variance)
Shape changes with λ:
- Small λ: Right-skewed, many zeros
- Large λ: More symmetric, approaches normal
Ecological interpretation: λ is the expected count (average number of events)

Important: The assumption that mean = variance is strict and often violated in ecological data. We’ll address this in Chapter 5 with negative binomial regression.

4.4 Poisson Regression: The GLM Approach

Poisson regression solves the problems with linear regression by using: 1. The Poisson distribution (appropriate for counts) 2. The log link function (ensures predictions are positive) 3. A model for the expected count as a function of predictors

The Poisson regression model is:

\[ \begin{align} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots \end{align} \]

Where: - \(Y_i\) is the observed count for observation \(i\) - \(\lambda_i\) is the expected count (Poisson parameter) - The log link ensures \(\lambda_i\) is always positive

4.4.1 The Log Link Function

# Visualize the log link
linear_pred <- seq(-3, 3, length.out = 100)
expected_count <- exp(linear_pred)  # Inverse of log link

link_df <- data.frame(
  linear_predictor = linear_pred,
  expected_count = expected_count
)

ggplot(link_df, aes(x = linear_predictor, y = expected_count)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Log Link Function: Linear Predictor to Expected Count",
    subtitle = "Exponential transformation ensures counts are always positive",
    x = "Linear Predictor (β₀ + β₁X₁ + ...)",
    y = "Expected Count (λ)"
  ) +
  theme_minimal()

The log link function ensures positive predictions

Understanding the log link

On the link scale (log): - \(\log(\lambda) = \beta_0 + \beta_1 X_1 + \ldots\) - Coefficients (β) represent changes in log count - Model is additive on the log scale

On the response scale (count): - \(\lambda = e^{\beta_0 + \beta_1 X_1 + \ldots}\) - Exponentiating gives us the expected count - Model is multiplicative on the count scale

Key insight: A one-unit increase in X₁ multiplies the expected count by \(e^{\beta_1}\)

This is why we interpret Poisson coefficients as rate ratios after exponentiating!

4.5 Fitting Poisson Regression in R

Let’s fit a Poisson regression to our bird count data:

# Fit Poisson regression model
bird_poisson <- glm(bird_count ~ restoration_age + native_plant_cover + patch_size,
                   family = poisson(link = "log"),
                   data = bird_data)

# Summary
summary(bird_poisson)


Call:
glm(formula = bird_count ~ restoration_age + native_plant_cover + 
    patch_size, family = poisson(link = "log"), data = bird_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        0.5646230  0.0759830   7.431 1.08e-13 ***
restoration_age    0.0760829  0.0032618  23.325  < 2e-16 ***
native_plant_cover 0.0149407  0.0006847  21.822  < 2e-16 ***
patch_size         0.0293899  0.0013371  21.981  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1701.62  on 149  degrees of freedom
Residual deviance:  181.39  on 146  degrees of freedom
AIC: 876.66

Number of Fisher Scoring iterations: 4

Interpreting the Poisson regression output

Let’s break down the key components:

Deviance Residuals: - Should be roughly symmetric around zero - Large values indicate poorly fitted observations

Coefficients (on log scale): - restoration_age: β = 0.0761 - native_plant_cover: β = 0.0149 - patch_size: β = 0.0294

All coefficients are positive and significant, suggesting all three factors increase bird abundance.

Deviance: - Null deviance: 1701.6 (model with intercept only) - Residual deviance: 181.4 (our fitted model) - Reduction: 1520.2 - substantial improvement!

AIC: 876.7 - useful for model comparison

But what do these coefficients actually mean? Let’s explore interpretation next.

4.6 Interpreting Poisson Regression Coefficients

The coefficients are on the log scale, which makes direct interpretation challenging. We need to exponentiate them to get rate ratios (also called incidence rate ratios or IRRs).

# Extract coefficients
coefs <- coef(bird_poisson)

# Calculate rate ratios (exponentiate coefficients)
rate_ratios <- exp(coefs)

# Create interpretation table
rr_table <- data.frame(
  Variable = names(coefs),
  Coefficient = round(coefs, 4),
  Rate_Ratio = round(rate_ratios, 4),
  Percent_Change = round((rate_ratios - 1) * 100, 2)
)

print(rr_table)

                             Variable Coefficient Rate_Ratio Percent_Change
(Intercept)               (Intercept)      0.5646     1.7588          75.88
restoration_age       restoration_age      0.0761     1.0791           7.91
native_plant_cover native_plant_cover      0.0149     1.0151           1.51
patch_size                 patch_size      0.0294     1.0298           2.98

Step-by-step interpretation of rate ratios

restoration_age (RR = 1.079): - For each additional year of restoration, the expected bird count is multiplied by 1.079 - This is a 7.9% increase per year - Over 10 years: count multiplied by 2.14

native_plant_cover (RR = 1.015): - Each 1% increase in native cover multiplies bird count by 1.015 - This is a 1.51% increase per 1% cover - A 10% increase in cover: count multiplied by 1.161

patch_size (RR = 1.03): - Each additional hectare multiplies bird count by 1.03 - This is a 2.98% increase per hectare

Key insight: Rate ratios > 1 indicate positive effects (count increases), rate ratios < 1 indicate negative effects (count decreases), and a rate ratio = 1 means no effect.

4.6.1 Confidence Intervals for Rate Ratios

# Get confidence intervals
conf_ints <- confint(bird_poisson)

Waiting for profiling to be done...

# Exponentiate for rate ratio scale
rr_ci_table <- exp(cbind(
  Rate_Ratio = coef(bird_poisson),
  conf_ints
))

print(round(rr_ci_table, 3))

                   Rate_Ratio 2.5 % 97.5 %
(Intercept)             1.759 1.514  2.039
restoration_age         1.079 1.072  1.086
native_plant_cover      1.015 1.014  1.016
patch_size              1.030 1.027  1.033

Interpreting confidence intervals

The 95% confidence intervals tell us the range of plausible values for the rate ratios.

For restoration_age: - 95% CI: [1.072, 1.086] - We’re 95% confident the true rate ratio lies in this range - Since the CI doesn’t include 1, the effect is statistically significant - All values > 1, confirming a positive effect

Similar interpretation for the other predictors - all CIs exclude 1, confirming significant positive effects.

4.7 Visualizing Poisson Regression Predictions

While rate ratios tell us about multiplicative effects, visualizing predictions on the count scale is often more intuitive:

# Create prediction grids for each variable
grid_age <- data.frame(
  restoration_age = seq(0, 20, length.out = 100),
  native_plant_cover = mean(bird_data$native_plant_cover),
  patch_size = mean(bird_data$patch_size)
)

grid_cover <- data.frame(
  restoration_age = mean(bird_data$restoration_age),
  native_plant_cover = seq(0, 100, length.out = 100),
  patch_size = mean(bird_data$patch_size)
)

grid_size <- data.frame(
  restoration_age = mean(bird_data$restoration_age),
  native_plant_cover = mean(bird_data$native_plant_cover),
  patch_size = seq(min(bird_data$patch_size), max(bird_data$patch_size), length.out = 100)
)

# Get predictions with confidence intervals
get_poisson_predictions <- function(model, newdata) {
  pred <- predict(model, newdata = newdata, type = "link", se.fit = TRUE)
  newdata$count <- exp(pred$fit)
  newdata$lower <- exp(pred$fit - 1.96 * pred$se.fit)
  newdata$upper <- exp(pred$fit + 1.96 * pred$se.fit)
  return(newdata)
}

grid_age <- get_poisson_predictions(bird_poisson, grid_age)
grid_cover <- get_poisson_predictions(bird_poisson, grid_cover)
grid_size <- get_poisson_predictions(bird_poisson, grid_size)

# Create plots
p1 <- ggplot(grid_age, aes(x = restoration_age, y = count)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "steelblue") +
  geom_line(color = "steelblue", size = 1.2) +
  labs(
    title = "Effect of Restoration Age",
    x = "Restoration Age (years)",
    y = "Expected Bird Count"
  ) +
  theme_minimal()

p2 <- ggplot(grid_cover, aes(x = native_plant_cover, y = count)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "darkgreen") +
  geom_line(color = "darkgreen", size = 1.2) +
  labs(
    title = "Effect of Native Plant Cover",
    x = "Native Plant Cover (%)",
    y = "Expected Bird Count"
  ) +
  theme_minimal()

p3 <- ggplot(grid_size, aes(x = patch_size, y = count)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "purple") +
  geom_line(color = "purple", size = 1.2) +
  labs(
    title = "Effect of Patch Size",
    x = "Patch Size (ha)",
    y = "Expected Bird Count"
  ) +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, ncol = 2)

Predicted bird counts from Poisson regression model

Ecological interpretation of predictions

Restoration age effect: - Expected bird count increases exponentially with restoration age - New restorations (0-5 years) have low abundance (< 5 birds) - Mature restorations (15-20 years) support much higher abundance (> 10 birds) - Management implication: Be patient - restoration takes time!

Native plant cover effect: - Strong positive relationship throughout the range - Sites with < 20% native cover: < 5 birds expected - Sites with > 80% native cover: > 12 birds expected - Management implication: Prioritize increasing native plant diversity

Patch size effect: - Larger patches support more birds, as expected from species-area relationships - Effect is approximately linear on the count scale - Small patches (< 5 ha) support few birds - Management implication: Protect and restore large contiguous patches when possible

All three factors work together - the best sites for grassland birds are large, old restorations with high native plant cover!

4.8 Model Diagnostics for Poisson Regression

4.8.1 Deviance and Model Fit

# Extract deviance information
null_dev <- bird_poisson$null.deviance
resid_dev <- bird_poisson$deviance
df_null <- bird_poisson$df.null
df_resid <- bird_poisson$df.residual

# Calculate pseudo R-squared
pseudo_r2 <- 1 - (resid_dev / null_dev)

cat("Null Deviance:", round(null_dev, 2), "on", df_null, "df\n")

Null Deviance: 1701.62 on 149 df

cat("Residual Deviance:", round(resid_dev, 2), "on", df_resid, "df\n")

Residual Deviance: 181.39 on 146 df

cat("Reduction:", round(null_dev - resid_dev, 2), "\n")

Reduction: 1520.23

cat("Pseudo R²:", round(pseudo_r2, 4), "\n\n")

Pseudo R²: 0.8934

# Likelihood ratio test
lr_stat <- null_dev - resid_dev
lr_df <- df_null - df_resid
lr_p <- pchisq(lr_stat, lr_df, lower.tail = FALSE)

cat("Likelihood Ratio Test:\n")

Likelihood Ratio Test:

cat("Chi-squared =", round(lr_stat, 2), "on", lr_df, "df\n")

Chi-squared = 1520.23 on 3 df

cat("p-value < 0.001\n")

p-value < 0.001

Understanding deviance in Poisson regression

Deviance measures how well the model fits the data: - Lower deviance = better fit - Reduction from null to residual deviance shows improvement from adding predictors - Our model reduces deviance by 89.3%

Pseudo R²: 0.893 suggests decent explanatory power

Likelihood ratio test: Highly significant (p < 0.001), confirming predictors meaningfully improve the model

4.8.2 Checking for Overdispersion

A critical assumption of Poisson regression is that mean = variance. Let’s check:

# Calculate dispersion parameter
dispersion <- resid_dev / df_resid

cat("Dispersion parameter:", round(dispersion, 3), "\n")

Dispersion parameter: 1.242

cat("\nInterpretation:\n")


Interpretation:

if (dispersion < 1) {
  cat("  Dispersion < 1: Possible underdispersion\n")
} else if (dispersion <= 1.5) {
  cat("  Dispersion ≈ 1: Poisson assumption reasonable\n")
} else if (dispersion <= 3) {
  cat("  Dispersion > 1: Mild overdispersion\n")
} else {
  cat("  Dispersion >> 1: Substantial overdispersion - consider negative binomial!\n")
}

  Dispersion ≈ 1: Poisson assumption reasonable

# Formal test
library(AER)

Warning: package 'AER' was built under R version 4.3.3

Loading required package: lmtest

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Loading required package: survival

disptest_result <- dispersiontest(bird_poisson)
print(disptest_result)


    Overdispersion test

data:  bird_poisson
z = 1.193, p-value = 0.1164
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.148533

What is overdispersion and why does it matter?

Overdispersion occurs when variance > mean (violating the Poisson assumption that mean = variance).

Signs of overdispersion: - Dispersion parameter > 1 (residual deviance / df > 1) - Significant dispersion test - Residual deviance >> degrees of freedom

Consequences if ignored: - Standard errors are too small (underestimated uncertainty) - P-values are too small (false positives) - Confidence intervals are too narrow - You might claim significance when there isn’t any!

Common causes in ecology: - Individual heterogeneity - Spatial clustering (contagion) - Missing important predictors - Temporal correlation

What to do: - If dispersion ≈ 1-1.5: Poisson is okay - If dispersion > 1.5-2: Consider quasi-Poisson (adjusts SEs) - If dispersion > 2-3: Use negative binomial regression (Chapter 5)

Our model: Dispersion = 1.24. Poisson seems appropriate!

4.8.3 Residual Plots

# Calculate residuals
bird_data$pearson_resid <- residuals(bird_poisson, type = "pearson")
bird_data$deviance_resid <- residuals(bird_poisson, type = "deviance")
bird_data$fitted <- fitted(bird_poisson)

# Create diagnostic plots
p1 <- ggplot(bird_data, aes(x = fitted, y = deviance_resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(
    title = "Deviance Residuals vs. Fitted",
    x = "Fitted Values",
    y = "Deviance Residuals"
  ) +
  theme_minimal()

p2 <- ggplot(bird_data, aes(sample = deviance_resid)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(
    title = "Q-Q Plot",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

p3 <- ggplot(bird_data, aes(x = seq_along(deviance_resid), y = deviance_resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-3, 3), linetype = "dashed", color = "orange") +
  labs(
    title = "Index Plot",
    x = "Observation Number",
    y = "Deviance Residuals"
  ) +
  theme_minimal()

p4 <- ggplot(bird_data, aes(x = fitted, y = sqrt(abs(deviance_resid)))) +
  geom_point(alpha = 0.6) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Scale-Location Plot",
    x = "Fitted Values",
    y = "√|Deviance Residuals|"
  ) +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Interpreting Poisson regression diagnostics

Residuals vs. Fitted: - Look for random scatter around zero - The blue smoothed line should be flat - Patterns suggest missing predictors or wrong model form

Q-Q Plot: - Points should follow the diagonal line - Count data shows some discreteness, so perfect normality not expected - Major deviations in tails are concerning

Index Plot: - Identifies outliers (beyond ±3) - Look for patterns suggesting temporal or spatial structure

Scale-Location: - Checks for constant variance - For Poisson, some increase with fitted values is expected - Strong patterns suggest overdispersion

Our model: Diagnostics look reasonable, supporting the Poisson model choice.

4.9 Offsets: Accounting for Different Exposure Times

Sometimes counts come from different sampling efforts, time periods, or area sizes. We use offsets to account for this.

Example: If one site was surveyed for 2 hours and another for 1 hour, we expect twice as many birds at the first site just due to effort!

# Hypothetical example with survey effort
# If you had a variable 'survey_hours' in your data:

bird_poisson_offset <- glm(bird_count ~ restoration_age + offset(log(survey_hours)),
                          family = poisson,
                          data = bird_data)

Understanding offsets

Offset = a predictor with coefficient fixed at 1

Without offset: \[\log(\lambda) = \beta_0 + \beta_1 X_1\]

With offset: \[\log(\lambda) = \log(\text{effort}) + \beta_0 + \beta_1 X_1\]

Which rearranges to: \[\log\left(\frac{\lambda}{\text{effort}}\right) = \beta_0 + \beta_1 X_1\]

So we’re modeling the rate (counts per unit effort), not raw counts!

When to use offsets: - Survey effort varies (time, area, traps) - Population sizes differ (counts per capita) - Exposure times vary (person-years in epidemiology)

The offset variable should be log-transformed because we’re working on the log scale.

4.10 Model Comparison and Selection

Let’s compare several candidate models:

# Fit candidate models
m_null <- glm(bird_count ~ 1, family = poisson, data = bird_data)

m_age <- glm(bird_count ~ restoration_age, family = poisson, data = bird_data)

m_age_cover <- glm(bird_count ~ restoration_age + native_plant_cover, 
                  family = poisson, data = bird_data)

m_full <- glm(bird_count ~ restoration_age + native_plant_cover + patch_size,
             family = poisson, data = bird_data)

m_interaction <- glm(bird_count ~ restoration_age * native_plant_cover + patch_size,
                    family = poisson, data = bird_data)

# Create comparison table
models <- list(
  "Null" = m_null,
  "Age Only" = m_age,
  "Age + Cover" = m_age_cover,
  "Full (Age + Cover + Size)" = m_full,
  "With Interaction" = m_interaction
)

aic_table <- data.frame(
  Model = names(models),
  AIC = sapply(models, AIC),
  Deviance = sapply(models, deviance),
  df = sapply(models, function(x) x$df.residual)
)

aic_table$Delta_AIC <- aic_table$AIC - min(aic_table$AIC)
aic_table$AIC_weight <- exp(-0.5 * aic_table$Delta_AIC) / sum(exp(-0.5 * aic_table$Delta_AIC))

# Sort by AIC
aic_table <- aic_table[order(aic_table$AIC), ]
print(aic_table)

                                              Model       AIC  Deviance  df
Full (Age + Cover + Size) Full (Age + Cover + Size)  876.6576  181.3930 146
With Interaction                   With Interaction  878.5934  181.3287 145
Age + Cover                             Age + Cover 1377.7961  684.5314 147
Age Only                                   Age Only 1998.2117 1306.9470 148
Null                                           Null 2390.8844 1701.6197 149
                            Delta_AIC    AIC_weight
Full (Age + Cover + Size)    0.000000  7.246953e-01
With Interaction             1.935745  2.753047e-01
Age + Cover                501.138466 1.094763e-109
Age Only                  1121.554067 2.078664e-244
Null                      1514.226752  0.000000e+00

Model selection results

Best model: Full (Age + Cover + Size) (lowest AIC)

AIC weight: 0.725 - suggests 72.5% probability this is the best model

Second best: With Interaction (ΔAIC = 1.94)

Interpretation: - Models within ΔAIC < 2 have substantial support - Models with ΔAIC > 10 have essentially no support - The close competition suggests both models are plausible

Ecological conclusion: An additive model is sufficient - no strong evidence for interactions

4.11 A Complete Example: Seed Predation Study

Let’s work through another example from start to finish.

4.11.1 The Dataset

Researchers studied seed predation by rodents in forest fragments. They placed seed trays at different distances from forest edges and counted how many seeds were removed.

Dataset: seed_predation.csv

Variables: - seeds_removed: Number of seeds removed from tray (out of 50) - distance_m: Distance from forest edge in meters - canopy_density: Percent canopy closure above the trap - season: Season when trap was set (spring, summer, fall)

# Load data
seed_data <- read.csv("data/seed_predation.csv")

# Structure and summary
str(seed_data)

'data.frame':   120 obs. of  4 variables:
 $ seeds_removed : int  10 11 4 2 4 3 7 6 5 6 ...
 $ distance_m    : num  9 21.1 73.3 85.2 78.8 33.2 8.2 28.6 23.8 38.5 ...
 $ canopy_density: num  20.9 24 32.2 71.8 51.5 79.8 81.1 58.3 56.2 25.8 ...
 $ season        : chr  "spring" "summer" "summer" "fall" ...

summary(seed_data)

 seeds_removed      distance_m    canopy_density     season         
 Min.   : 0.000   Min.   : 0.20   Min.   :20.90   Length:120        
 1st Qu.: 3.000   1st Qu.:32.20   1st Qu.:36.30   Class :character  
 Median : 4.000   Median :54.85   Median :57.55   Mode  :character  
 Mean   : 4.925   Mean   :54.93   Mean   :57.63                     
 3rd Qu.: 7.000   3rd Qu.:79.35   3rd Qu.:79.47                     
 Max.   :14.000   Max.   :99.30   Max.   :94.30

# Convert season to factor
seed_data$season <- factor(seed_data$season, 
                           levels = c("spring", "summer", "fall"))

# Visualize
p1 <- ggplot(seed_data, aes(x = distance_m, y = seeds_removed)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", color = "darkred") +
  labs(title = "Seeds Removed vs. Distance from Edge",
       x = "Distance from Edge (m)",
       y = "Seeds Removed") +
  theme_minimal()

p2 <- ggplot(seed_data, aes(x = season, y = seeds_removed, fill = season)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(title = "Seed Predation by Season",
       x = "Season",
       y = "Seeds Removed") +
  theme_minimal() +
  theme(legend.position = "none")

gridExtra::grid.arrange(p1, p2, ncol = 2)

`geom_smooth()` using formula = 'y ~ x'

# Fit Poisson regression
seed_model <- glm(seeds_removed ~ distance_m + canopy_density + season,
                 family = poisson,
                 data = seed_data)

summary(seed_model)


Call:
glm(formula = seeds_removed ~ distance_m + canopy_density + season, 
    family = poisson, data = seed_data)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.60927    0.13641  19.129  < 2e-16 ***
distance_m     -0.01089    0.00148  -7.354 1.93e-13 ***
canopy_density -0.01108    0.00187  -5.929 3.05e-09 ***
seasonsummer    0.33661    0.09812   3.431 0.000602 ***
seasonfall      0.06044    0.10840   0.558 0.577167    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 222.66  on 119  degrees of freedom
Residual deviance: 124.90  on 115  degrees of freedom
AIC: 524.92

Number of Fisher Scoring iterations: 4

# Rate ratios
exp(coef(seed_model))

   (Intercept)     distance_m canopy_density   seasonsummer     seasonfall 
    13.5891736      0.9891738      0.9889765      1.4002005      1.0623018

Interpreting the seed predation model

Distance from edge (RR = 0.989): - Seeds removal DECREASES with distance from edge - Each additional meter multiplies expected removal by 0.989 - Edge effect - predators more active near edges

Canopy density: - Denser canopy reduces predation - Makes sense - rodents may avoid open areas

Season: - Summer shows higher predation than spring - Fall shows higher predation than spring - Seasonal variation in rodent activity

4.12 Connecting to Future Topics

Poisson regression is the foundation for understanding more complex count models:

Links to upcoming chapters

Chapter 5: Negative Binomial Regression - Extends Poisson to handle overdispersion (when variance > mean) - Adds a dispersion parameter for flexibility - Same interpretation as Poisson, just more flexible

Chapter 6: Zero-Inflated Models - For counts with excess zeros - Combines logistic regression (from Chapter 3) with Poisson/negative binomial - Two-part models: zero vs. non-zero, then count

The progression: Poisson → Negative Binomial (overdispersion) → Zero-Inflated (excess zeros)

All use the log link and rate ratio interpretation you’ve learned here!

4.13 Exercise: Analyzing Plant Seedling Counts

Now it’s your turn! Analyze seedling recruitment data from a reforestation study.

# Load seedling recruitment data
seedling_recruitment <- read.csv("data/seedling_recruitment.csv")

# Variables:
# - seedling_count: Number of seedlings established in plot
# - light_availability: Percent canopy openness
# - soil_nutrients: Soil nutrient index
# - distance_to_seed_source: Distance to nearest mature tree (m)

# Your tasks:
# 1. Explore the data - create histograms and scatterplots
# [YOUR CODE HERE]

# 2. Check mean and variance - is Poisson appropriate?
# [YOUR CODE HERE]

# 3. Fit a Poisson regression model
# [YOUR CODE HERE]

# 4. Interpret coefficients as rate ratios
# [YOUR CODE HERE]

# 5. Check for overdispersion
# [YOUR CODE HERE]

# 6. Create prediction plots
# [YOUR CODE HERE]

# 7. Compare candidate models with AIC
# [YOUR CODE HERE]

Hint 1

Start with basic descriptives:

summary(seedling_recruitment$seedling_count)
var(seedling_recruitment$seedling_count) / mean(seedling_recruitment$seedling_count)

Variance/mean ratio near 1 suggests Poisson is appropriate.

Hint 2

For rate ratios:

exp(coef(your_model))
exp(confint(your_model))

Remember: RR > 1 = positive effect, RR < 1 = negative effect

Hint 3

Check overdispersion:

deviance(your_model) / df.residual(your_model)

Value > 1.5-2 suggests overdispersion.

Solution.

library(tidyverse)
library(AER)

# 1. Load and explore
seedling_recruitment <- read.csv("data/seedling_recruitment.csv")
str(seedling_recruitment)
summary(seedling_recruitment)

# Distribution
hist(seedling_recruitment$seedling_count, breaks = 20,
     main = "Distribution of Seedling Counts",
     xlab = "Number of Seedlings")

# Scatterplots
par(mfrow = c(2, 2))
plot(seedling_count ~ light_availability, data = seedling_recruitment)
plot(seedling_count ~ soil_nutrients, data = seedling_recruitment)
plot(seedling_count ~ distance_to_seed_source, data = seedling_recruitment)
par(mfrow = c(1, 1))

# 2. Check mean vs variance
mean_count <- mean(seedling_recruitment$seedling_count)
var_count <- var(seedling_recruitment$seedling_count)
cat("Mean:", mean_count, "\n")
cat("Variance:", var_count, "\n")
cat("Variance/Mean ratio:", var_count / mean_count, "\n")

# 3. Fit Poisson model
seedling_model <- glm(seedling_count ~ light_availability + soil_nutrients + distance_to_seed_source,
                     family = poisson,
                     data = seedling_recruitment)

summary(seedling_model)

# 4. Rate ratios
cat("\n=== Rate Ratios ===\n")
rr <- exp(coef(seedling_model))
rr_ci <- exp(confint(seedling_model))
print(cbind(Rate_Ratio = rr, rr_ci))

# Interpretation
cat("\nInterpretation:\n")
cat("Light: Each 1% increase multiplies seedling count by", round(rr["light_availability"], 3), "\n")
cat("Nutrients: Each unit increase multiplies count by", round(rr["soil_nutrients"], 3), "\n")
cat("Distance: Each meter further multiplies count by", round(rr["distance_to_seed_source"], 3), "\n")

# 5. Check overdispersion
dispersion <- deviance(seedling_model) / df.residual(seedling_model)
cat("\nDispersion parameter:", round(dispersion, 3), "\n")

if (dispersion > 2) {
  cat("WARNING: Substantial overdispersion detected!\n")
  cat("Consider using negative binomial regression (Chapter 5)\n")
}

# Formal test
disptest <- dispersiontest(seedling_model)
print(disptest)

# 6. Prediction plots
pred_grid <- expand.grid(
  light_availability = seq(min(seedling_recruitment$light_availability),
                          max(seedling_recruitment$light_availability),
                          length.out = 100),
  soil_nutrients = mean(seedling_recruitment$soil_nutrients),
  distance_to_seed_source = mean(seedling_recruitment$distance_to_seed_source)
)

pred_grid$predicted <- predict(seedling_model, newdata = pred_grid, type = "response")

ggplot() +
  geom_point(data = seedling_recruitment, 
            aes(x = light_availability, y = seedling_count),
            alpha = 0.5) +
  geom_line(data = pred_grid,
           aes(x = light_availability, y = predicted),
           color = "blue", size = 1.2) +
  labs(title = "Effect of Light on Seedling Recruitment",
       x = "Light Availability (%)",
       y = "Expected Seedling Count") +
  theme_minimal()

# 7. Model comparison
m1 <- glm(seedling_count ~ light_availability, family = poisson, data = seedling_recruitment)
m2 <- glm(seedling_count ~ light_availability + soil_nutrients, family = poisson, data = seedling_recruitment)
m3 <- seedling_model
m4 <- glm(seedling_count ~ light_availability * soil_nutrients + distance_to_seed_source, 
         family = poisson, data = seedling_recruitment)

AIC(m1, m2, m3, m4)

4.14 Key Takeaways

Poisson regression is for count data (non-negative integers)
Uses Poisson distribution (mean = variance = λ)
Uses log link to ensure positive predictions
Coefficients are on log scale; exponentiate for rate ratios
Rate ratios show multiplicative effects on expected counts
Overdispersion (variance > mean) is common - check for it!
If overdispersed, use quasi-Poisson or negative binomial (Chapter 5)
Offsets account for different exposures/efforts
Same model comparison tools as logistic regression (AIC, likelihood ratio tests)

4.15 Looking Ahead

In Chapter 5, we’ll learn about negative binomial regression, which extends Poisson to handle overdispersed count data by adding a dispersion parameter. This is one of the most commonly used models in ecology!

4.16 Additional Resources

Zuur, A. F., Ieno, E. N., Walker, N. J., Saveliev, A. A., & Smith, G. M. (2009). Mixed Effects Models and Extensions in Ecology with R. Springer. [Chapter 4 on GLMs for count data]
Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
Ver Hoef, J. M., & Boveng, P. L. (2007). Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology, 88(11), 2766-2772.
Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.
Bolker, B. M. (2008). Ecological Models and Data in R. Princeton University Press. [Chapter 9 on count data]