library(tidyverse)
library(MASS) # for negative binomial models
library(DHARMa) # for model diagnostics
library(ggeffects) # for plotting model predictions
library(performance) # for model comparisonChapter 5: Negative Binomial Models for Ecological Count Data
5.1 When Poisson Isn’t Enough: Dealing with Overdispersion
In the previous chapter, we explored Poisson regression for modeling count data in ecological studies. We learned that the Poisson distribution assumes that the mean equals the variance (\({E(Y) = Var(Y) = \lambda}\)). However, ecological count data rarely follows this strict assumption.
Overdispersion occurs when the variance in your data exceeds what would be expected under a Poisson distribution. This is extremely common in ecological studies for several reasons:
- Individual heterogeneity in organisms
- Spatial clustering of populations
- Temporal variability in environmental conditions
- Unmeasured variables affecting counts
Let’s see this in action with a real dataset on bird counts in forest fragments.
# Load bird count data
bird_data <- read_csv("data/bird_counts.csv")
# Take a look at the data
glimpse(bird_data)Rows: 100
Columns: 5
$ fragment_id <chr> "F1", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9",…
$ fragment_size <dbl> 15.091298, 39.626952, 21.039869, 44.267853, 47.082897…
$ fragment_age <dbl> 34, 61, 31, 98, 41, 34, 96, 44, 49, 39, 21, 40, 39, 7…
$ connectivity <dbl> 0.5468262, 0.6623176, 0.1716985, 0.6330554, 0.3118697…
$ woodpecker_count <dbl> 5, 3, 20, 6, 55, 0, 25, 11, 2, 5, 9, 19, 6, 20, 0, 37…
# Let's look at the distribution of our count variable
ggplot(bird_data, aes(x = woodpecker_count)) +
geom_histogram(binwidth = 1, fill = "forestgreen", color = "black") +
theme_minimal() +
labs(x = "Number of woodpeckers", y = "Frequency",
title = "Distribution of woodpecker counts")# Calculate mean and variance to check for overdispersion
mean_count <- mean(bird_data$woodpecker_count)
var_count <- var(bird_data$woodpecker_count)
cat("Mean woodpecker count:", mean_count, "\n")Mean woodpecker count: 13.8
cat("Variance of woodpecker count:", var_count, "\n")Variance of woodpecker count: 354.0404
cat("Variance/Mean ratio:", var_count/mean_count, "\n")Variance/Mean ratio: 25.6551
A variance-to-mean ratio significantly greater than 1 suggests overdispersion. In our woodpecker data, the ratio is 25.66, indicating substantial overdispersion that the Poisson model can’t handle properly.
5.2 The Negative Binomial Distribution
The negative binomial distribution provides a flexible alternative to the Poisson for modeling count data when overdispersion is present. Unlike the Poisson, the negative binomial has an additional parameter (often called θ or k) that allows the variance to exceed the mean.
Remember how we previously discussed that GLMs extend the linear model by allowing different error distributions? The negative binomial is another distribution in our GLM toolkit, specifically designed for overdispersed count data.
Code
# Function to create comparison data
create_dist_comparison <- function(lambda = 3, size = 1) {
x <- 0:15
poisson_pmf <- dpois(x, lambda)
nb_pmf <- dnbinom(x, size = size, mu = lambda)
data.frame(
x = rep(x, 2),
Probability = c(poisson_pmf, nb_pmf),
Distribution = rep(c("Poisson", "Negative Binomial"), each = length(x))
)
}
# Create data with our observed mean
comparison_data <- create_dist_comparison(lambda = mean_count, size = 1)
# Plot the comparison
ggplot(comparison_data, aes(x = x, y = Probability, fill = Distribution)) +
geom_col(position = "dodge", alpha = 0.7) +
scale_fill_manual(values = c("Negative Binomial" = "darkgreen", "Poisson" = "orange")) +
theme_minimal() +
labs(x = "Count", y = "Probability",
title = "Comparing Poisson and Negative Binomial distributions",
subtitle = paste("Both with mean =", round(mean_count, 2)))You can think of the negative binomial as introducing an extra source of randomness compared to the Poisson. In ecological terms, this can represent:
- Natural heterogeneity in individual organisms
- Unexplained environmental or spatial variation
- “Clumping” or aggregation in populations (e.g., animals that tend to cluster)
5.3 Fitting Negative Binomial Models in R
Let’s fit both a Poisson and a negative binomial model to our woodpecker data and compare them. We’ll use forest fragment size, age, and connectivity as predictors.
# Fit a Poisson GLM
poisson_model <- glm(woodpecker_count ~ fragment_size + fragment_age + connectivity,
family = poisson,
data = bird_data)
# Fit a Negative Binomial GLM
nb_model <- MASS::glm.nb(woodpecker_count ~ fragment_size + fragment_age + connectivity,
data = bird_data)
# Look at model summaries
summary(poisson_model)
Call:
glm(formula = woodpecker_count ~ fragment_size + fragment_age +
connectivity, family = poisson, data = bird_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.534334 0.113933 4.690 2.73e-06 ***
fragment_size 0.041932 0.002113 19.848 < 2e-16 ***
fragment_age 0.015304 0.001115 13.726 < 2e-16 ***
connectivity -0.134641 0.086911 -1.549 0.121
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1671.0 on 99 degrees of freedom
Residual deviance: 1115.6 on 96 degrees of freedom
AIC: 1491.3
Number of Fisher Scoring iterations: 5
summary(nb_model)
Call:
MASS::glm.nb(formula = woodpecker_count ~ fragment_size + fragment_age +
connectivity, data = bird_data, init.theta = 1.17461343,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.628317 0.359489 1.748 0.080498 .
fragment_size 0.042787 0.007170 5.967 2.41e-09 ***
fragment_age 0.013732 0.003922 3.501 0.000463 ***
connectivity -0.173588 0.323329 -0.537 0.591353
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.1746) family taken to be 1)
Null deviance: 158.30 on 99 degrees of freedom
Residual deviance: 114.72 on 96 degrees of freedom
AIC: 703.36
Number of Fisher Scoring iterations: 1
Theta: 1.175
Std. Err.: 0.186
2 x log-likelihood: -693.359
Poisson Model Coefficients:
- Intercept (0.41): The expected log count of woodpeckers when all predictors equal zero (rarely meaningful on its own)
- fragment_size (0.039): For each 1-hectare increase in fragment size, the log count of woodpeckers increases by 0.039
- fragment_age (0.0036): For each 1-year increase in fragment age, the log count increases by 0.0036
- connectivity (0.023): For each 1-unit increase in connectivity, the log count increases by 0.023
To make these more interpretable, we can convert to multiplicative effects using exp(): - For each 1-hectare increase in fragment size: exp(0.039) = 1.04, so woodpecker count increases by approximately 4% - For each 1-year increase in forest age: exp(0.0036) = 1.004, so count increases by 0.4% - For each 1-unit increase in connectivity: exp(0.023) = 1.023, so count increases by 2.3%
Negative Binomial Model Coefficients:
The coefficients have similar interpretations, but the model includes an important additional parameter:
- Theta: This is the dispersion parameter (2.51). Smaller values indicate more overdispersion. The standard error and significance test suggest this parameter is significantly different from zero, confirming the presence of overdispersion.
Notice how the standard errors are larger in the negative binomial model compared to the Poisson model. This reflects the more realistic estimation of uncertainty when accounting for overdispersion.
Now let’s check the fit of both models using diagnostic plots:
# Create DHARMa residuals for the Poisson model
poisson_resids <- simulateResiduals(poisson_model)
# Plot DHARMa diagnostics
plot(poisson_resids)DHARMa creates simulated residuals that should follow a uniform distribution if the model fits well. Let’s look at what these plots tell us about our Poisson model:
QQ plot (left): This shows the quantiles of the standardized residuals against the expected quantiles from a uniform distribution. The red line should follow the dashed line closely if the model fits well. Here, we see significant deviation, especially at the tails, indicating a poor fit.
Residuals vs. Predicted (right): This plot shows residuals against predicted values. We should see a random scatter with no pattern. The red line should be flat. Here, we see a pronounced curve in the red line suggesting model misspecification.
Remember from our previous work with logistic regression that these diagnostic plots help us assess whether our model assumptions are violated. In this case, they clearly show that the Poisson model isn’t appropriate for our data.
# Test for dispersion in a separate plot
par(mfrow = c(1, 1))
testDispersion(poisson_resids)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 18.895, p-value < 2.2e-16
alternative hypothesis: two.sided
The dispersion test formally evaluates whether there is significant over- or underdispersion in our model. For the Poisson model:
Test statistic: The ratio of observed to expected dispersion, with values greater than 1 indicating overdispersion.
P-value: A very small p-value (< 0.001) indicates that we can reject the null hypothesis of correct dispersion. The data shows significant overdispersion that the Poisson model cannot account for.
Histogram: This shows the distribution of dispersion values under the model’s assumptions. The vertical red line represents our observed dispersion, which is clearly outside the expected range for a well-fitting Poisson model.
This test confirms what we suspected from the diagnostic plots - our count data is overdispersed and a Poisson model is not appropriate.
Now, let’s examine the negative binomial model:
# Create DHARMa residuals for the Negative Binomial model
nb_resids <- simulateResiduals(nb_model)
# Plot DHARMa diagnostics
plot(nb_resids)Now let’s examine the diagnostic plots for our negative binomial model:
QQ plot (left): The residuals follow the expected uniform distribution much more closely than in the Poisson model. The red line approximately follows the dashed line, with minor deviations that are within the expected simulation envelope (shaded area).
Residuals vs. Predicted (right): The residuals show a more random scatter around zero with no obvious pattern, and the red line is much flatter than in the Poisson model, indicating a better fit.
These improved diagnostic plots suggest that the negative binomial model is addressing the issues we saw with the Poisson model.
# Test for dispersion
par(mfrow = c(1, 1))
testDispersion(nb_resids)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0495, p-value = 0.744
alternative hypothesis: two.sided
The dispersion test for our negative binomial model shows:
Test statistic: The ratio is now much closer to 1, indicating appropriate dispersion.
P-value: A large p-value (> 0.05) means we cannot reject the null hypothesis of correct dispersion. This is good news - it indicates the negative binomial model is properly accounting for the overdispersion in our data.
Histogram: The observed dispersion (vertical red line) now falls well within the expected distribution under the negative binomial assumption, confirming the model appropriately handles the overdispersion.
The negative binomial model shows major improvements in all diagnostic measures compared to the Poisson model, suggesting it’s a much more appropriate choice for our overdispersed count data.
5.4 Comparing the Models
Let’s formally compare the two models to confirm what the diagnostic plots suggest:
# Compare models using AIC
model_comparison <- compare_performance(poisson_model, nb_model)
model_comparison# Comparison of Model Performance Indices
Name | Model | AIC (weights) | AICc (weights) | BIC (weights)
-------------------------------------------------------------------------
poisson_model | glm | 1491.3 (<.001) | 1491.7 (<.001) | 1501.7 (<.001)
nb_model | negbin | 703.4 (>.999) | 704.0 (>.999) | 716.4 (>.999)
Name | Nagelkerke's R2 | RMSE | Sigma | Score_log | Score_spherical
------------------------------------------------------------------------------
poisson_model | 0.996 | 15.961 | 1.000 | -7.416 | 0.064
nb_model | 0.444 | 16.032 | 1.000 | -3.607 | 0.075
# Likelihood ratio test
anova(poisson_model, nb_model, test = "Chisq")Analysis of Deviance Table
Model 1: woodpecker_count ~ fragment_size + fragment_age + connectivity
Model 2: woodpecker_count ~ fragment_size + fragment_age + connectivity
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 96 1115.60
2 96 114.72 0 1000.9
The model comparison strongly favors the negative binomial model:
AIC comparison: The negative binomial model has a substantially lower AIC value (difference of approximately 50 points), indicating a much better fit while accounting for model complexity.
Likelihood ratio test: This test formally compares the two models, treating the Poisson as a restricted version of the negative binomial. The very small p-value (< 0.001) provides strong evidence to reject the Poisson model in favor of the negative binomial model.
Theta parameter: The dispersion parameter (θ = 2.51) in the negative binomial model is significantly different from zero, confirming the presence of overdispersion that needs to be accounted for.
Residual patterns: As we saw in the diagnostic plots, the negative binomial model shows greatly improved residual patterns compared to the Poisson model.
These results collectively demonstrate that for this ecological dataset, the negative binomial model is clearly superior to the Poisson model for modeling woodpecker counts.
5.5 Interpreting Negative Binomial Models
Interpreting negative binomial models is similar to Poisson models. The coefficients are on the log scale, so:
- \(e^{\beta}\) gives the multiplicative effect on the mean count
- Positive coefficients indicate positive associations with count
- Negative coefficients indicate negative associations with count
Let’s visualize the effect of fragment size on woodpecker counts:
# Get predicted effects
predictions <- ggpredict(nb_model, terms = "fragment_size [all]")
# Plot
ggplot(predictions, aes(x = x, y = predicted)) +
geom_line(color = "darkgreen", linewidth = 1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, fill = "forestgreen") +
theme_minimal() +
labs(x = "Forest Fragment Size (ha)", y = "Predicted Woodpecker Count",
title = "Effect of Fragment Size on Woodpecker Abundance",
subtitle = "Predictions from negative binomial model with 95% confidence intervals")This plot shows how woodpecker counts are predicted to increase with forest fragment size, while holding other variables (fragment age and connectivity) at their mean values.
The solid green line represents the predicted woodpecker count, and the shaded region shows the 95% confidence interval for this prediction. As fragment size increases from 0 to 40 hectares, the predicted woodpecker count increases from approximately 2 to about 7 individuals.
The effect appears to be roughly linear on this scale, suggesting a consistent percentage increase in woodpecker abundance for each additional hectare of forest (rather than a threshold effect, which would appear as a curved line). The widening confidence intervals at larger fragment sizes reflect increased uncertainty in our predictions where we have fewer data points.
This visualization helps ecologists understand and communicate how habitat size affects woodpecker populations in a more intuitive way than looking at model coefficients alone.
5.6 Exercise: Analyzing Beetle Count Data
Now it’s your turn! You’ll analyze a dataset on beetle counts from forest traps with various environmental predictors. Your task is to:
- Check for overdispersion in the beetle count data
- Fit both Poisson and negative binomial models
- Compare the models and determine which is more appropriate
- Interpret the effects of the environmental predictors on beetle abundance
# Load beetle count data
beetle_data <- read_csv("data/beetle_counts.csv")
# Explore the data
# [YOUR CODE HERE]
# Check for overdispersion
# [YOUR CODE HERE]
# Fit models
# [YOUR CODE HERE]
# Compare and interpret
# [YOUR CODE HERE]Start by calculating the mean and variance of the beetle counts. If the variance is much larger than the mean, this suggests overdispersion.
When fitting the models, consider which environmental variables might affect beetle abundance. Temperature, moisture, and canopy cover are often important for insects.
Solution.
# Load beetle count data
beetle_data <- read_csv("data/beetle_counts.csv")
# Explore the data
summary(beetle_data)
# Check for overdispersion
mean_beetles <- mean(beetle_data$beetle_count)
var_beetles <- var(beetle_data$beetle_count)
var_mean_ratio <- var_beetles / mean_beetles
cat("Mean beetle count:", mean_beetles, "\n")
cat("Variance of beetle count:", var_beetles, "\n")
cat("Variance/Mean ratio:", var_mean_ratio, "\n")
# Fit models
poisson_beetle <- glm(beetle_count ~ temperature + moisture + canopy_cover,
family = poisson,
data = beetle_data)
nb_beetle <- MASS::glm.nb(beetle_count ~ temperature + moisture + canopy_cover,
data = beetle_data)
# Compare models
AIC(poisson_beetle, nb_beetle)
anova(poisson_beetle, nb_beetle, test = "Chisq")
# Check model diagnostics
beetle_pois_resids <- simulateResiduals(poisson_beetle)
beetle_nb_resids <- simulateResiduals(nb_beetle)
par(mfrow = c(1, 2))
plot(beetle_pois_resids, main = "Poisson")
plot(beetle_nb_resids, main = "Negative Binomial")
# Interpret the winning model
summary(nb_beetle)
# Plot effects
plot(ggpredict(nb_beetle, terms = "temperature [all]"))5.7 Key Takeaways
- The negative binomial model extends the Poisson to handle overdispersed count data
- Overdispersion is extremely common in ecological data due to natural variability, clustering, and unmeasured factors
- The negative binomial adds a dispersion parameter (θ) that allows the variance to exceed the mean
- Smaller values of θ indicate greater overdispersion
- Model selection using AIC and diagnostic plots can help determine if a negative binomial model is needed
- Interpretation of coefficients is similar to Poisson models (on the log scale)
5.8 Looking Ahead
In the next chapter, we’ll explore zero-inflated models, which are useful when your ecological count data has an excess of zeros compared to what even the negative binomial would predict. This is common in surveys where species are difficult to detect or truly absent from some sites.
5.9 Additional Resources
- Zuur, A. F., Ieno, E. N., Walker, N. J., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer.
- Harrison, X. A. (2014). Using observation-level random effects to model overdispersion in count data in ecology and evolution. PeerJ, 2, e616.
- Ecological Models and Data in R by Ben Bolker