Chapter 7: Beta Regression for Proportion Data in Ecology

Author

Steffen Foerster

6.1 Handling Proportions: When Data Are Bounded Between 0 and 1

In previous chapters, we explored Poisson and negative binomial regression for modeling count data, as well as zero-inflated and hurdle models for handling excess zeros. Now, we turn to another common data type in ecological studies: proportions and rates that are bounded between 0 and 1.

Common proportion data in ecology

Proportion data are ubiquitous in ecology and evolutionary biology:

  • The proportion of time animals spend foraging or vigilant
  • Plant cover percentages (transformed to 0-1)
  • Infection rates in populations
  • Germination success rates of seeds
  • Habitat use proportions across environmental gradients
  • Survival rates of seedlings under different conditions
  • Normalized diversity indices bounded between 0 and 1

These variables share a crucial characteristic: they are constrained to lie between 0 and 1. This bounded nature creates challenges for standard regression approaches that require special modeling techniques.

library(betareg)     # for beta regression
library(ggplot2)     # for plotting
library(dplyr)       # for data manipulation
library(tidyr)       # for data reshaping
library(gridExtra)   # for arranging multiple plots

Let’s first see why standard regression approaches can be problematic with proportion data.

# Simulate some ecological proportion data
set.seed(123)
n <- 100
x <- rnorm(n)

# Generate proportion data with a relationship to x
# Note how variance changes with the mean
linear_pred <- 0.3 + 0.2 * x
p <- plogis(linear_pred)  # Convert to probability scale
y <- rbeta(n, p * 10, (1 - p) * 10)  # Generate from beta

prop_data <- data.frame(
  predictor = x,
  proportion = y
)

# Linear model for comparison
lm_model <- lm(proportion ~ predictor, data = prop_data)
pred_lm <- predict(lm_model, newdata = data.frame(predictor = seq(-3, 3, length.out = 100)))

# Create a plotting dataframe
pred_df <- data.frame(
  predictor = seq(-3, 3, length.out = 100),
  predicted = pred_lm
)

# Plot
ggplot(prop_data, aes(x = predictor, y = proportion)) +
  geom_point(alpha = 0.7) +
  geom_line(data = pred_df, aes(x = predictor, y = predicted), color = "blue") +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "red", alpha = 0.7) +
  labs(
    title = "Proportion Data with Linear Model",
    subtitle = "Red lines show bounds (0-1) that linear predictions may exceed",
    x = "Environmental Predictor",
    y = "Proportion Response"
  ) +
  theme_bw()

The problem with linear regression for proportions

Observe in the plot above:

  1. The variance in the data changes with the mean (heteroscedasticity)
  2. The linear regression line (blue) could potentially predict impossible values outside the [0,1] bounds
  3. Near the boundaries, the relationship appears non-linear

These issues violate core assumptions of linear regression and can lead to misleading inferences and predictions.

6.2 The Beta Distribution: A Flexible Model for Proportions

Just as the Poisson and negative binomial distributions are appropriate for count data, the beta distribution is specifically designed for modeling continuous variables bounded between 0 and 1.

In our GLM framework, we’ve seen how different error distributions are appropriate for different types of data: binomial for binary data, Poisson and negative binomial for counts, and now beta for proportions.

The beta distribution is remarkably flexible and can take many different shapes depending on its two shape parameters, which allows it to model various patterns commonly observed in ecological proportion data:

# Function to calculate beta distribution density
beta_density <- function(x, shape1, shape2) {
  dbeta(x, shape1, shape2)
}

# Create a dataframe for plotting
x_values <- seq(0.001, 0.999, length.out = 1000)
plot_data <- expand.grid(
  x = x_values,
  distribution = c("U-shaped (0.5, 0.5)", "Uniform (1, 1)", "Symmetric (5, 5)", 
                   "Right-skewed (2, 5)", "Left-skewed (5, 2)", "Bimodal (0.2, 0.2)")
)

# Calculate densities
plot_data$density <- mapply(
  function(x, dist) {
    params <- switch(
      dist,
      "U-shaped (0.5, 0.5)" = c(0.5, 0.5),
      "Uniform (1, 1)" = c(1, 1),
      "Symmetric (5, 5)" = c(5, 5),
      "Right-skewed (2, 5)" = c(2, 5),
      "Left-skewed (5, 2)" = c(5, 2),
      "Bimodal (0.2, 0.2)" = c(0.2, 0.2)
    )
    beta_density(x, params[1], params[2])
  },
  plot_data$x, plot_data$distribution
)

# Plot
ggplot(plot_data, aes(x = x, y = density)) +
  geom_line() +
  facet_wrap(~ distribution, scales = "free_y") +
  labs(
    title = "Flexibility of the Beta Distribution",
    subtitle = "The beta distribution can model many shapes of proportion data",
    x = "Value (proportion between 0 and 1)",
    y = "Density"
  ) +
  theme_bw()

Different shapes the beta distribution can take
Ecological interpretation of beta distribution shapes

Different shapes of the beta distribution correspond to different ecological scenarios:

  • U-shaped: Cases where values tend to be either very high or very low, such as species that strongly prefer or avoid certain habitats
  • Uniform: When all proportions are equally likely, such as in completely random processes
  • Bell-shaped: For values clustered around a central tendency, like body size ratios or moderate germination rates
  • Right-skewed: When small values are common but large values are rare, such as survival rates in harsh environments
  • Left-skewed: When large values are common but small values are rare, such as success rates in favorable conditions
  • Bimodal: When there are two distinct “types” in your data, such as mixed strategies in behavior

Parameters of the Beta Distribution

Traditionally, the beta distribution has two shape parameters (α and β). In beta regression, it’s more convenient to parameterize it in terms of mean (μ) and precision (φ) where:

  • μ (mean): Determines the central tendency
  • φ (precision): Controls the spread or dispersion (higher values = less dispersion)

The relationship between these parameterizations is: - α = μφ - β = (1-μ)φ

This reparameterization makes it easier to model the mean as a function of predictors while controlling for dispersion.

6.3 Beta Regression Model Structure

Beta regression works by modeling the mean of the beta distribution as a function of predictors using a link function, similar to how we approached other GLMs.

The standard model can be expressed as:

\[ \begin{align} Y &\sim \text{Beta}(\mu\phi, (1-\mu)\phi) \\ \text{logit}(\mu) &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p \end{align} \]

Where: - The logit link function ensures predictions stay between 0 and 1 - β₀, β₁, etc. are the coefficients we estimate - X₁, X₂, etc. are our predictor variables

Similar to the precision parameter in negative binomial regression, the φ parameter in beta regression allows for more flexibility in modeling the variance structure.

6.4 Fitting Beta Regression Models in R

Let’s explore beta regression with a realistic ecological example. Suppose we’re studying the germination success of a plant species across different soil moisture levels and light conditions.

# Simulate ecological data
set.seed(42)
n <- 100

# Predictors: soil moisture and light level (both scaled for interpretation)
moisture <- runif(n, 0, 10)
light <- runif(n, 0, 10)

# Create linear predictor with interaction
linear_pred <- -2 + 0.3 * moisture + 0.2 * light - 0.02 * moisture * light

# Convert to probability scale (0-1)
mu <- plogis(linear_pred)

# Generate beta distributed response (germination success)
# Higher precision (phi=20) for more realistic data
phi <- 20
germination_rate <- rbeta(n, mu * phi, (1 - mu) * phi)

# Create dataframe
seed_data <- data.frame(
  germination_rate = germination_rate,
  moisture = moisture,
  light = light
)

# View the first few rows
head(seed_data)
  germination_rate moisture    light
1        0.5345536 9.148060 6.262453
2        0.5571294 9.370754 2.171577
3        0.1955642 2.861395 2.165673
4        0.7281666 8.304476 3.889450
5        0.6328664 6.417455 9.424557
6        0.6692901 5.190959 9.626080

Let’s visualize our data:

# Create a grid for prediction later
moisture_grid <- seq(min(seed_data$moisture), max(seed_data$moisture), length.out = 50)
light_grid <- c(2, 5, 8)  # Low, medium, and high light levels
grid_data <- expand.grid(moisture = moisture_grid, light = light_grid)

# Plot the data
ggplot(seed_data, aes(x = moisture, y = germination_rate, color = light)) +
  geom_point(alpha = 0.7) +
  scale_color_viridis_c() +
  labs(
    title = "Seed Germination Success vs. Soil Moisture",
    subtitle = "Color indicates light levels (higher values = more light)",
    x = "Soil Moisture",
    y = "Germination Rate"
  ) +
  theme_bw()

Relationship between soil moisture and germination rate, with light levels shown by color

Now, let’s fit a beta regression model to our data:

# Fit a beta regression model with interaction
seed_model <- betareg(germination_rate ~ moisture * light, data = seed_data)

# Model summary
summary(seed_model)

Call:
betareg(formula = germination_rate ~ moisture * light, data = seed_data)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.0849 -0.6396 -0.1149  0.5309  2.5712 

Coefficients (mean model with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.330704   0.205278 -11.354  < 2e-16 ***
moisture        0.355177   0.033033  10.752  < 2e-16 ***
light           0.245891   0.032674   7.526 5.24e-14 ***
moisture:light -0.027414   0.005325  -5.148 2.63e-07 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   24.913      3.462   7.196 6.22e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 97.06 on 5 Df
Pseudo R-squared: 0.7373
Number of iterations: 25 (BFGS) + 2 (Fisher scoring) 
Interpreting the beta regression output

Looking at our model summary:

  1. Coefficient estimates: These are on the logit scale, similar to logistic regression.

    • Intercept: The log-odds of germination when moisture and light are both zero
    • moisture: Effect of moisture on log-odds when light is zero
    • light: Effect of light on log-odds when moisture is zero
    • moisture:light: Interaction term showing how the effect of one variable depends on the level of the other
  2. Precision parameter (phi): Estimated at approximately 6.6025455^{10} (on the original scale). Higher values indicate less dispersion in the data.

  3. Significance: The p-values indicate which predictors significantly affect germination rates.

  4. Pseudo R-squared: Value of approximately 0.737 indicates the model explains about 73.7% of the variation in germination rates.

Let’s visualize the predictions from our model:

# Make predictions on our grid
grid_data$predicted <- predict(seed_model, newdata = grid_data, type = "response")

# Convert light to factor for better plotting
grid_data$light_factor <- factor(grid_data$light, 
                                labels = c("Low (2)", "Medium (5)", "High (8)"))

# Plot data with model predictions
ggplot() +
  geom_point(data = seed_data, aes(x = moisture, y = germination_rate, color = light), alpha = 0.5) +
  geom_line(data = grid_data, aes(x = moisture, y = predicted, color = light, group = light_factor), 
            size = 1.5) +
  scale_color_viridis_c("Light Level") +
  labs(
    title = "Beta Regression Model: Germination Success",
    subtitle = "Lines show predicted germination rates at different light levels",
    x = "Soil Moisture",
    y = "Germination Rate"
  ) +
  theme_bw()

Beta regression model predictions showing how germination rates vary with moisture and light
Ecological interpretation of the interaction

The plot reveals an important ecological interaction: the effect of soil moisture on germination depends on light levels.

  • Under low light conditions (darker blue line), increasing soil moisture consistently improves germination rates
  • Under high light conditions (yellow line), the benefit of increasing moisture plateaus more quickly
  • This suggests that light levels potentially modify a plant’s ability to utilize available soil moisture

Such interactions are common in ecology, where multiple environmental factors often work together to influence biological processes. Beta regression allows us to model these complex relationships while respecting the bounded nature of proportion data.

6.5 Model Checking

Model diagnostics for beta regression are not as standardized as for linear models. The plot() function for beta regression objects produces four diagnostic plots that require careful interpretation:

par(mfrow = c(2, 2))
plot(seed_model)

Diagnostic plots for the beta regression model
Interpreting beta regression diagnostics

These four diagnostic plots help assess different aspects of model fit:

  1. Residuals vs. Indices: Shows standardized residuals plotted against observation indices.
    • Look for randomly scattered points around zero
    • Patterns might indicate temporal trends or outliers
  2. Cook’s Distance: Measures the influence of each observation.
    • Points with high values (> 0.5 or 1) may disproportionately affect model estimates
    • Large spikes warrant investigation of those specific observations
  3. Generalized Leverage vs. Predicted Values: Shows how much each observation influences its own predicted value.
    • Points with high leverage and large residuals are particularly concerning
    • Look for observations in the upper corners of the plot
  4. Residuals vs. Linear Predictor: Plots residuals against the linear predictor (logit scale).
    • Systematic patterns suggest model misspecification or inappropriate link function
    • Ideally, points should be randomly scattered around zero

Let’s create some additional diagnostic plots that can be helpful:

# Calculate predicted values on response scale
fitted_values <- predict(seed_model, type = "response")

# Calculate raw residuals
raw_residuals <- seed_data$germination_rate - fitted_values

# Set up plotting area
par(mfrow = c(2, 2))

# Plot 1: Observed vs. Predicted
plot(fitted_values, seed_data$germination_rate, 
     xlab = "Predicted Values", ylab = "Observed Values",
     main = "Observed vs. Predicted Values")
abline(0, 1, col = "red")  # Add line of perfect prediction

# Plot 2: Histogram of residuals
hist(raw_residuals, breaks = 20, 
     main = "Histogram of Raw Residuals", 
     xlab = "Residual")

# Plot 3 & 4: Residuals vs predictors
plot(seed_data$moisture, raw_residuals, 
     xlab = "Moisture", ylab = "Raw Residuals",
     main = "Residuals vs. Moisture")
abline(h = 0, col = "red", lty = 2)

plot(seed_data$light, raw_residuals, 
     xlab = "Light", ylab = "Raw Residuals",
     main = "Residuals vs. Light")
abline(h = 0, col = "red", lty = 2)

Additional diagnostic plots for beta regression

These additional plots help us evaluate:

  1. Observed vs. Predicted: How well the model predictions match the observed data
  2. Residual Distribution: Whether residuals are roughly symmetric around zero
  3. Residuals vs. Predictors: Whether there are patterns in residuals related to specific predictors that might indicate missing terms in the model

6.6 Comparing Beta Regression with Alternative Approaches

Let’s compare our beta regression with two common alternatives: linear regression and logit-transformed linear regression.

# Linear regression (problematic for proportions)
lm_model <- lm(germination_rate ~ moisture * light, data = seed_data)

# Linear regression on logit-transformed response
# Adding a small constant to avoid exact 0s and 1s
seed_data$germ_logit <- log((seed_data$germination_rate + 0.001) / 
                          (1 - seed_data$germination_rate + 0.001))
logit_model <- lm(germ_logit ~ moisture * light, data = seed_data)

# Compare AIC (need to be cautious with different response transformations)
AIC(seed_model)
[1] -184.12
AIC(lm_model)
[1] -179.7055

Let’s visualize predictions from all three models to see how they differ:

# Predictions (using light = 5 for simplicity)
new_data <- data.frame(moisture = seq(0, 10, length.out = 100), light = 5)

# Beta regression predictions
new_data$beta_pred <- predict(seed_model, newdata = new_data, type = "response")

# Linear model predictions
new_data$lm_pred <- predict(lm_model, newdata = new_data)

# Logit-transform predictions (back-transformed)
new_data$logit_pred <- predict(logit_model, newdata = new_data)
new_data$logit_pred <- exp(new_data$logit_pred) / (1 + exp(new_data$logit_pred))

# Convert to long format for plotting
long_data <- pivot_longer(
  new_data, 
  cols = c(beta_pred, lm_pred, logit_pred),
  names_to = "model", 
  values_to = "prediction"
)

# Prettier names
long_data$model <- factor(
  long_data$model,
  levels = c("beta_pred", "lm_pred", "logit_pred"),
  labels = c("Beta Regression", "Linear Regression", "Logit-transformed")
)

# Plot
ggplot() +
  geom_point(data = seed_data, aes(x = moisture, y = germination_rate), alpha = 0.3) +
  geom_line(data = long_data, aes(x = moisture, y = prediction, color = model), 
            size = 1.2) +
  labs(
    title = "Model Comparison for Germination Rate Prediction",
    subtitle = "Light level fixed at 5",
    x = "Soil Moisture",
    y = "Germination Rate",
    color = "Model"
  ) +
  coord_cartesian(ylim = c(0, 1)) +  # Ensure y-axis shows full [0,1] range
  theme_bw()

Comparison of predictions from three different modeling approaches
Advantages of beta regression over alternatives

The comparison reveals several key differences:

  1. Linear regression: Can predict impossible values outside the [0,1] range and assumes constant variance.

  2. Linear regression on logit-transformed response: Keeps predictions bounded but is sensitive to how you handle 0s and 1s. Interpretation is less straightforward since the model operates on the transformed scale.

  3. Beta regression: Naturally respects the bounds, models the changing variance appropriately, and gives directly interpretable results. It handles the curvature in the relationship more naturally.

While all three approaches give similar predictions in this particular example, this isn’t always the case. The differences become more pronounced when: - Data includes values very close to 0 or 1 - The relationship is strongly non-linear - The variance changes substantially across the range of the response

For proportion data in ecology, beta regression typically provides more appropriate and reliable inference than these alternatives.

6.7 Interpreting Beta Regression Coefficients

Beta regression coefficients are on the logit scale, which makes their direct interpretation challenging. Let’s break this down carefully:

# Extract coefficients
coefs <- coef(seed_model)[1:4]  # Excluding phi parameter

# Create a table for interpretation
coef_table <- data.frame(
  Coefficient = names(coefs),
  Estimate = coefs,
  Exponentiated = exp(coefs)
)

coef_table
                  Coefficient    Estimate Exponentiated
(Intercept)       (Intercept) -2.33070397    0.09722728
moisture             moisture  0.35517684    1.42643288
light                   light  0.24589124    1.27876049
moisture:light moisture:light -0.02741352    0.97295882
Understanding Beta Regression Coefficients

In beta regression, we’re modeling the mean of a proportion (μ), which is bound between 0 and 1. The model uses a logit link function:

\(\text{logit}(\mu) = \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots\)

Let’s interpret this step by step:

  1. The coefficients (β) represent changes in the log-odds of the mean proportion.

  2. Exponentiated coefficients (exp(β)) represent multiplicative effects on the odds of the mean proportion.

  3. To clarify what “odds of a proportion” means:

    • If the mean proportion is μ = 0.75, then the odds are 0.75/(1-0.75) = 3 to 1
    • We’re not talking about the odds of a binary outcome, but the odds related to the continuous proportion itself
  4. Example interpretation: For the moisture coefficient of 0.355:

    • Each unit increase in moisture multiplies the odds of the mean germination rate by a factor of exp(0.355) = 1.426
    • If the mean germination rate at moisture = 2 is 0.6, then the odds are 0.6/0.4 = 1.5
    • At moisture = 3, the odds would become 1.5 × 1.426 = 2.14
    • Converting back to a proportion: 2.14/(2.14 + 1) = 0.681
  5. Direction of effects:

    • Positive coefficients (β > 0) or exponentiated values > 1 indicate that increasing the predictor increases the mean proportion
    • Negative coefficients (β < 0) or exponentiated values < 1 indicate that increasing the predictor decreases the mean proportion

This interpretation is mathematically similar to logistic regression, but conceptually different because we’re modeling a continuous proportion rather than the probability of a binary event.

To visualize this effect more concretely, let’s see how the predicted proportion changes with increasing moisture, holding light constant:

# Create sequence of moisture values
moisture_seq <- seq(0, 10, by = 0.1)

# Fixed light value
light_value <- 5

# Create prediction dataframe
pred_df <- data.frame(
  moisture = moisture_seq,
  light = light_value
)

# Get predictions
pred_df$germination <- predict(seed_model, newdata = pred_df, type = "response")

# Calculate odds at each point
pred_df$odds <- pred_df$germination / (1 - pred_df$germination)

# Plot
p1 <- ggplot(pred_df, aes(x = moisture, y = germination)) +
  geom_line(size = 1.2, color = "blue") +
  labs(
    title = "Effect on Germination Rate",
    x = "Moisture",
    y = "Predicted Germination Rate"
  ) +
  theme_bw()

p2 <- ggplot(pred_df, aes(x = moisture, y = odds)) +
  geom_line(size = 1.2, color = "red") +
  labs(
    title = "Effect on Odds",
    x = "Moisture",
    y = "Odds of Germination Rate"
  ) +
  theme_bw()

# Arrange plots
grid.arrange(p1, p2, ncol = 2)

The figure shows how moisture affects both the predicted germination rate (left) and its odds (right). While the relationship with the germination rate itself is curved (bound between 0 and 1), the relationship with the odds shows the multiplicative effect more clearly.

A practical way to think about beta regression coefficients is to focus on their direction and relative magnitude. In our germination example:

  • Positive coefficients for moisture and light tell us that increasing either factor generally increases germination success
  • The negative interaction term suggests that the benefit of increasing one factor diminishes as the other increases
  • The relative magnitudes help identify which factors have stronger effects

6.8 Handling Zero and One Values

6.8.1 The Challenge with Boundary Values

The standard beta distribution is defined on the open interval (0,1), which means it cannot model exact zeros or ones. However, in ecological data, these values often occur naturally:

  • A plant species may be completely absent (0% cover) in some plots
  • All seeds might germinate (100% success) under optimal conditions
  • None of the observed animals might engage in a specific behavior during a sampling period

Let’s look at different strategies for handling these boundary values:

# Create a dataset with some 0s and 1s
set.seed(123)
n <- 100
x <- runif(n, -2, 2)
mu <- plogis(1 + 2*x)
y <- rbeta(n, mu*5, (1-mu)*5)

# Add some exact 0s and 1s
y[1:5] <- 0
y[6:10] <- 1

boundary_data <- data.frame(
  x = x,
  y = y
)

# Visualize
ggplot(boundary_data, aes(x = x, y = y)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Proportion Data with Boundary Values",
    subtitle = "Data contains exact zeros and ones",
    x = "Predictor",
    y = "Proportion"
  ) +
  theme_bw()

6.8.2 Data Transformation Approach

The simplest approach is to slightly adjust zeros and ones to bring them within the (0,1) interval:

# Create a function to adjust values
adjust_prop <- function(y, n) {
  (y * (n - 1) + 0.5) / n
}

# Adjust and visualize
boundary_data$adjusted_y <- adjust_prop(boundary_data$y, n)

# Compare original and adjusted values for boundary cases
head(data.frame(
  original = boundary_data$y[1:12], 
  adjusted = boundary_data$adjusted_y[1:12]
), 12)
    original  adjusted
1  0.0000000 0.0050000
2  0.0000000 0.0050000
3  0.0000000 0.0050000
4  0.0000000 0.0050000
5  0.0000000 0.0050000
6  1.0000000 0.9950000
7  1.0000000 0.9950000
8  1.0000000 0.9950000
9  1.0000000 0.9950000
10 1.0000000 0.9950000
11 0.9984209 0.9934367
12 0.3955323 0.3965769
# Fit beta regression with adjusted data
adj_model <- betareg(adjusted_y ~ x, data = boundary_data)
summary(adj_model)

Call:
betareg(formula = adjusted_y ~ x, data = boundary_data)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.4600 -0.3285  0.0272  0.4258  2.5469 

Coefficients (mean model with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.4758     0.1192   3.993 6.53e-05 ***
x             0.7986     0.1118   7.140 9.32e-13 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   1.4723     0.1879   7.836 4.65e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 70.12 on 3 Df
Pseudo R-squared: 0.418
Number of iterations: 13 (BFGS) + 1 (Fisher scoring) 

This transformation: - Shrinks values toward 0.5 - Moves zeros to 0.5/n - Moves ones to (n-0.5)/n - Leaves values in the middle relatively unchanged

While simple, this approach can be somewhat arbitrary. The choice of adjustment (0.5 here) affects the results and becomes less impactful with larger sample sizes.

6.8.3 Zero-and-One Inflated Beta Regression

A more sophisticated approach treats zeros and ones as coming from separate processes. The zero-and-one inflated beta regression (ZOIB) model consists of three parts:

  1. A binary process for whether the value equals 0
  2. A binary process for whether the value equals 1 (given that it’s not 0)
  3. A beta regression for values between 0 and 1 (given that the value is neither 0 nor 1)

This can be implemented using the zoib package in R:

# Note: This code is provided for reference but not evaluated
# install.packages("zoib")
library(zoib)

# Fit zero-and-one inflated beta regression
zoib_model <- zoib(
  y ~ x,                # Formula for the beta component
  data = boundary_data,
  zero.inf = ~ x,       # Formula for the zero-inflation component
  one.inf = ~ x,        # Formula for the one-inflation component
  joint = FALSE         # Use separate predictors for each component
)

summary(zoib_model)

The ZOIB model allows different predictors to affect: - The probability of observing a zero - The probability of observing a one - The mean of the beta component for values between 0 and 1 - The precision of the beta component

This approach is particularly useful in ecology when zeros or ones represent qualitatively different states, such as: - Complete absence (0) vs. varying degrees of presence - Complete success (1) vs. varying degrees of partial success

6.8.4 Alternative Approaches

Several other approaches exist for handling boundary values:

  1. Mixture models: Combine discrete distributions for 0 and 1 with a continuous beta distribution

  2. Beta-binomial models: When proportions come from count data, this approach accounts for the “true” underlying rate and sampling variation

  3. Simplex models: Use the simplex distribution, which allows values on the closed interval [0,1]

  4. Logistic regression: If your proportion data comes from counts (successes/failures), a binomial GLM might be more appropriate

  5. Tweedie models: For cases where data includes many zeros and continuous positive values

6.8.5 When to Use Each Approach

Choose your approach based on your research question and data characteristics:

  • Data transformation: Simple, works well when boundary values are rare and not of primary interest
  • ZOIB models: When zeros and ones represent distinct ecological processes worth modeling separately
  • Beta-binomial: When you have the original count data and sample sizes vary across observations
  • Logistic regression: When proportions represent binary outcomes with known denominators

Let’s demonstrate with a practical ecological example:

# Create a dataset mimicking algal cover across a nutrient gradient
set.seed(456)
n <- 100
nutrients <- runif(n, 0, 10)
temperature <- rnorm(n, 25, 5)

# Generate zeros with higher probability at low nutrients
zero_prob <- exp(-0.5 * nutrients) 
is_zero <- runif(n) < zero_prob

# Generate ones with higher probability at high nutrients and moderate temperatures
one_prob <- plogis(-3 + 0.4 * nutrients - 0.01 * (temperature - 25)^2)
is_one <- (runif(n) < one_prob) & (!is_zero)

# Generate beta values for the rest
mu <- plogis(-2 + 0.35 * nutrients - 0.015 * (temperature - 22)^2)
phi <- 5
algae_cover <- rep(NA, n)
algae_cover[is_zero] <- 0
algae_cover[is_one] <- 1
algae_cover[!(is_zero | is_one)] <- rbeta(
  sum(!(is_zero | is_one)), 
  mu[!(is_zero | is_one)] * phi, 
  (1 - mu[!(is_zero | is_one)]) * phi
)

# Create dataframe
algae_data <- data.frame(
  cover = algae_cover,
  nutrients = nutrients,
  temperature = temperature
)

# Visualize
ggplot(algae_data, aes(x = nutrients, y = cover, color = temperature)) +
  geom_point(alpha = 0.7) +
  scale_color_viridis_c() +
  labs(
    title = "Algal Cover Across Nutrient Gradient",
    subtitle = "Note the exact zeros (no algae) and ones (complete coverage)",
    x = "Nutrient Level",
    y = "Proportion Cover"
  ) +
  theme_bw()

# Apply transformation for beta regression
algae_data$adj_cover <- adjust_prop(algae_data$cover, nrow(algae_data))

# Fit model
algae_model <- betareg(adj_cover ~ nutrients + temperature + I(temperature^2), 
                      data = algae_data)

# Summarize
summary(algae_model)

Call:
betareg(formula = adj_cover ~ nutrients + temperature + I(temperature^2), 
    data = algae_data)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.1009 -0.5359 -0.1199  0.5224  2.3315 

Coefficients (mean model with logit link):
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -8.272739   2.941977  -2.812  0.00492 ** 
nutrients         0.359697   0.046995   7.654 1.95e-14 ***
temperature       0.578346   0.239788   2.412  0.01587 *  
I(temperature^2) -0.012468   0.004815  -2.590  0.00961 ** 

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   1.2184     0.1554   7.841 4.48e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 97.07 on 5 Df
Pseudo R-squared: 0.4931
Number of iterations: 23 (BFGS) + 3 (Fisher scoring) 

In this algal cover example: - Zeros represent complete absence of algae (common in low-nutrient conditions) - Ones represent complete coverage (common in high-nutrient conditions with moderate temperatures) - Values between 0-1 represent partial cover varying with both nutrients and temperature

A ZOIB model would be ideal here as it could separately model: 1. What factors determine complete absence (zero) 2. What factors lead to complete coverage (one) 3. What factors influence partial coverage levels (between 0-1)

6.9 Variable Precision Models

A powerful feature of beta regression is the ability to model not just the mean but also the precision of the distribution. This is similar to modeling dispersion in negative binomial models.

For example, we might hypothesize that the variability in germination (not just the mean) changes with soil moisture:

# Fit a variable precision model
var_prec_model <- betareg(germination_rate ~ moisture * light | moisture, 
                        data = seed_data)

# Summary of variable precision model
summary(var_prec_model)

Call:
betareg(formula = germination_rate ~ moisture * light | moisture, data = seed_data)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.0024 -0.6582 -0.1195  0.5517  2.4033 

Coefficients (mean model with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.333457   0.215180 -10.844  < 2e-16 ***
moisture        0.356768   0.032891  10.847  < 2e-16 ***
light           0.247003   0.034700   7.118 1.09e-12 ***
moisture:light -0.027674   0.005373  -5.151 2.59e-07 ***

Phi coefficients (precision model with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.93156    0.27806  10.543   <2e-16 ***
moisture     0.05685    0.04610   1.233    0.217    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 97.75 on 6 Df
Pseudo R-squared: 0.7373
Number of iterations: 13 (BFGS) + 1 (Fisher scoring) 
Understanding the formula syntax

In the formula germination_rate ~ moisture * light | moisture: - Before the |: predictors for the mean model - After the |: predictors for the precision model

This allows modeling how the variability in the response (not just its central tendency) changes with predictors.

Let’s compare the constant and variable precision models:

# Compare models with AIC
AIC(seed_model)       # Constant precision
[1] -184.12
AIC(var_prec_model)   # Variable precision
[1] -183.4983
# Likelihood ratio test
lmtest::lrtest(seed_model, var_prec_model)
Likelihood ratio test

Model 1: germination_rate ~ moisture * light
Model 2: germination_rate ~ moisture * light | moisture
  #Df LogLik Df  Chisq Pr(>Chisq)
1   5 97.060                     
2   6 97.749  1 1.3782     0.2404

The comparison shows whether allowing the precision to vary with moisture significantly improves model fit. A significant improvement would suggest that the variability in germination itself depends on soil moisture, which has important ecological implications.

6.10 Connecting to Previous Material

Beta regression extends our GLM toolkit in a natural way:

  1. Like logistic regression (which we used for binary outcomes), beta regression uses a logit link function. The difference is that while logistic regression models probabilities of discrete outcomes (0/1), beta regression models continuous proportions.

  2. Like Poisson and negative binomial models, beta regression uses a non-normal error distribution that’s appropriate for the data type. The negative binomial added a dispersion parameter to the Poisson; similarly, the beta distribution has a precision parameter that handles varying dispersion.

  3. Similar to hurdle models, which separate the zero-generating process from the count process, there are zero-one inflated beta models for when you have excess zeros or ones in your proportion data.

6.11 A Real-World Case Study: Species Diversity and Land Use

Let’s explore a slightly more complex ecological example. Suppose we’re studying how habitat fragmentation affects species diversity in forest patches:

# Simulate more realistic ecological data
set.seed(123)
n <- 150

# Predictors: patch size, isolation, and edge ratio
patch_size <- runif(n, 1, 50)  # Hectares
isolation <- runif(n, 0, 15)   # km to nearest patch
edge_ratio <- runif(n, 0.1, 1) # Edge to area ratio

# Create linear predictor with realistic relationships:
# - Larger patches have higher diversity
# - More isolated patches have lower diversity
# - Patches with more edge have lower diversity
linear_pred <- -1 + 0.06 * patch_size - 0.15 * isolation - 1.2 * edge_ratio

# Convert to probability scale (0-1)
mu <- plogis(linear_pred)

# Generate beta distributed response (normalized species diversity)
phi <- 15  # Precision parameter
diversity_index <- rbeta(n, mu * phi, (1 - mu) * phi)

# Create dataframe
diversity_data <- data.frame(
  diversity_index = diversity_index,
  patch_size = patch_size,
  isolation = isolation,
  edge_ratio = edge_ratio
)

# Fit beta regression model
diversity_model <- betareg(diversity_index ~ patch_size + isolation + edge_ratio, 
                         data = diversity_data)

# Summarize the model
summary(diversity_model)

Call:
betareg(formula = diversity_index ~ patch_size + isolation + edge_ratio, 
    data = diversity_data)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-3.8015 -0.6297  0.1984  0.6294  2.2641 

Coefficients (mean model with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.943700   0.166248  -5.676 1.38e-08 ***
patch_size   0.061040   0.003696  16.514  < 2e-16 ***
isolation   -0.138099   0.012490 -11.057  < 2e-16 ***
edge_ratio  -1.512221   0.194933  -7.758 8.65e-15 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   15.350      1.774   8.654   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 167.6 on 5 Df
Pseudo R-squared: 0.6747
Number of iterations: 20 (BFGS) + 2 (Fisher scoring) 

Let’s visualize how isolation affects diversity at different patch sizes:

# Create grid for prediction
patch_levels <- c(5, 15, 30)  # Small, medium, large patches
isolation_grid <- seq(0, 15, length.out = 100)
edge_mean <- mean(diversity_data$edge_ratio)  # Use mean edge ratio for prediction

pred_grid <- expand.grid(
  patch_size = patch_levels,
  isolation = isolation_grid,
  edge_ratio = edge_mean
)

# Make predictions
pred_grid$predicted <- predict(diversity_model, newdata = pred_grid, type = "response")

# Categorize patch size for plotting
pred_grid$size_cat <- factor(
  pred_grid$patch_size,
  levels = c(5, 15, 30),
  labels = c("Small (5 ha)", "Medium (15 ha)", "Large (30 ha)")
)

# Plot the effect of isolation at different patch sizes
ggplot() +
  geom_point(data = diversity_data, aes(x = isolation, y = diversity_index, size = patch_size), 
             alpha = 0.3) +
  geom_line(data = pred_grid, aes(x = isolation, y = predicted, color = size_cat), 
           size = 1.5) +
  scale_color_viridis_d("Patch Size") +
  scale_size_continuous(guide = "none") +
  labs(
    title = "Effect of Isolation on Species Diversity Index",
    subtitle = "Lines show predicted diversity at different patch sizes (edge ratio held at mean)",
    x = "Isolation Distance (km)",
    y = "Species Diversity Index (0-1)"
  ) +
  theme_bw() +
  theme(legend.position = "top")

Effect of isolation on species diversity at different patch sizes

Let’s also visualize the partial effects of each predictor while holding others constant:

# Create prediction grids for each variable
pred_patch <- data.frame(
  patch_size = seq(min(diversity_data$patch_size), max(diversity_data$patch_size), length.out = 100),
  isolation = mean(diversity_data$isolation),
  edge_ratio = mean(diversity_data$edge_ratio)
)

pred_isolation <- data.frame(
  patch_size = mean(diversity_data$patch_size),
  isolation = seq(min(diversity_data$isolation), max(diversity_data$isolation), length.out = 100),
  edge_ratio = mean(diversity_data$edge_ratio)
)

pred_edge <- data.frame(
  patch_size = mean(diversity_data$patch_size),
  isolation = mean(diversity_data$isolation),
  edge_ratio = seq(min(diversity_data$edge_ratio), max(diversity_data$edge_ratio), length.out = 100)
)

# Make predictions
pred_patch$predicted <- predict(diversity_model, newdata = pred_patch, type = "response")
pred_isolation$predicted <- predict(diversity_model, newdata = pred_isolation, type = "response")
pred_edge$predicted <- predict(diversity_model, newdata = pred_edge, type = "response")

# Create plots
p1 <- ggplot() +
  geom_point(data = diversity_data, aes(x = patch_size, y = diversity_index), alpha = 0.3) +
  geom_line(data = pred_patch, aes(x = patch_size, y = predicted), size = 1.5, color = "blue") +
  labs(
    title = "Effect of Patch Size",
    x = "Patch Size (ha)",
    y = "Diversity Index"
  ) +
  theme_bw()

p2 <- ggplot() +
  geom_point(data = diversity_data, aes(x = isolation, y = diversity_index), alpha = 0.3) +
  geom_line(data = pred_isolation, aes(x = isolation, y = predicted), size = 1.5, color = "red") +
  labs(
    title = "Effect of Isolation",
    x = "Isolation (km)",
    y = "Diversity Index"
  ) +
  theme_bw()

p3 <- ggplot() +
  geom_point(data = diversity_data, aes(x = edge_ratio, y = diversity_index), alpha = 0.3) +
  geom_line(data = pred_edge, aes(x = edge_ratio, y = predicted), size = 1.5, color = "green4") +
  labs(
    title = "Effect of Edge Ratio",
    x = "Edge-to-Area Ratio",
    y = "Diversity Index"
  ) +
  theme_bw()

# Arrange plots
grid.arrange(p1, p2, p3, nrow = 2, 
             top = "Partial Effects in Beta Regression Model of Species Diversity")

Partial effects of each predictor on species diversity
Interpreting the diversity model

The model reveals consistent patterns that match theoretical expectations in landscape ecology:

  1. Patch size has a positive effect on diversity (larger patches support more species)
  2. Isolation has a negative effect (more isolated patches have lower diversity)
  3. Edge ratio has a strong negative effect (patches with more edge relative to area have lower diversity)

These patterns align with the theory of island biogeography and habitat fragmentation research, where area, isolation, and edge effects are key drivers of species diversity in fragmented landscapes.

6.12 Model Selection and Averaging with Beta Regression

As with previous models, we can apply model selection to beta regression to identify the most parsimonious model. Let’s create a set of candidate models:

# Create several candidate models
model1 <- betareg(diversity_index ~ patch_size + isolation + edge_ratio, 
                 data = diversity_data)
model2 <- betareg(diversity_index ~ patch_size + isolation, 
                 data = diversity_data)
model3 <- betareg(diversity_index ~ patch_size + edge_ratio, 
                 data = diversity_data)
model4 <- betareg(diversity_index ~ isolation + edge_ratio, 
                 data = diversity_data)
model5 <- betareg(diversity_index ~ patch_size * isolation + edge_ratio, 
                 data = diversity_data)

# Compare models
models <- list(
  full = model1,
  no_edge = model2,
  no_isolation = model3,
  no_patch = model4,
  interaction = model5
)

# Calculate AIC values
aic_values <- sapply(models, AIC)
aic_table <- data.frame(
  Model = names(models),
  AIC = aic_values,
  Delta_AIC = aic_values - min(aic_values),
  Weight = exp(-0.5 * (aic_values - min(aic_values))) / 
           sum(exp(-0.5 * (aic_values - min(aic_values))))
)

# Sort by AIC
aic_table <- aic_table[order(aic_table$AIC), ]
aic_table
                    Model       AIC  Delta_AIC       Weight
full                 full -325.1890   0.000000 7.294258e-01
interaction   interaction -323.2056   1.983423 2.705742e-01
no_edge           no_edge -275.7637  49.425290 1.350255e-11
no_isolation no_isolation -236.4431  88.745892 3.908910e-20
no_patch         no_patch -161.2035 163.985550 1.794615e-36
Model selection for beta regression

The table above shows AIC values, delta-AIC, and AIC weights for our candidate models. The model with the lowest AIC provides the best balance of model fit and complexity.

In practice, you might consider:

  1. Models with delta-AIC < 2 are considered to have substantial support
  2. AIC weights represent the relative likelihood of each model
  3. When multiple models have similar support, model averaging can provide more robust inference

For beta regression, the same principles of model selection apply as with our previous GLMs like Poisson and negative binomial.

6.13 Key Takeaways

  • Beta regression is specifically designed for modeling continuous proportions bounded between 0 and 1
  • The beta distribution is highly flexible and can model many different patterns in proportion data
  • Beta regression naturally accommodates the changing variance structure common in proportion data
  • The logit link function ensures predictions remain within the valid range (0-1)
  • Variable precision models allow modeling both the mean and the dispersion of proportions
  • Diagnostics for beta regression require careful interpretation and are not as standardized as for linear models
  • Beta regression typically provides more appropriate inference than linear models or transformed approaches

6.14 Looking Ahead

In the next chapter, we’ll explore multinomial regression models, which extend our GLM framework to categorical response variables with more than two levels. This will allow us to model outcomes like habitat selection among multiple habitat types or behavioral states with multiple categories.

6.15 Exercise: Analyzing Pollination Success Rates

Now it’s your turn! In this exercise, you’ll analyze a dataset on pollination success rates across different plant communities.

# Load the pollination data
pollination_data <- read.csv("data/pollination_success.csv")

# Explore the data structure
# [YOUR CODE HERE]

# Check for zeros and ones
# [YOUR CODE HERE]

# Fit a beta regression model
# [YOUR CODE HERE]

# Interpret coefficients and assess model fit
# [YOUR CODE HERE]

# Create visualizations of the model predictions
# [YOUR CODE HERE]

Start by exploring the data structure and checking for zeros and ones. If present, consider using the adjustment technique we covered to handle these boundary values.

Think about which covariates might affect pollination success. Consider both the mean model and whether variable precision is needed.

Remember to transform coefficients to odds ratios for easier ecological interpretation.

Solution.

# Load the pollination data (this would be a real dataset in practice)
# For this example, let's create simulated data
set.seed(456)
n <- 120
floral_density <- runif(n, 1, 10)
pollinator_diversity <- runif(n, 0, 15)
temperature <- rnorm(n, 25, 5)
plant_community <- factor(sample(c("Forest", "Meadow", "Shrubland"), n, replace = TRUE))

# Create response variable (pollination success rate)
linear_pred <- -1.5 + 0.3 * floral_density + 0.1 * pollinator_diversity - 0.05 * temperature
linear_pred <- linear_pred + ifelse(plant_community == "Meadow", 0.5, 
                                   ifelse(plant_community == "Shrubland", 0.2, 0))
mu <- plogis(linear_pred)
phi <- 12
poll_success <- rbeta(n, mu * phi, (1 - mu) * phi)

# Create the dataset
pollination_data <- data.frame(
  pollination_success = poll_success,
  floral_density = floral_density,
  pollinator_diversity = pollinator_diversity,
  temperature = temperature,
  plant_community = plant_community
)

# Explore the data structure
str(pollination_data)
summary(pollination_data)

# Visualize the distribution of the response
hist(pollination_data$pollination_success, breaks = 20,
     main = "Distribution of Pollination Success Rates",
     xlab = "Pollination Success Rate")

# Check for zeros and ones
sum(pollination_data$pollination_success == 0)
sum(pollination_data$pollination_success == 1)

# Handle zeros and ones if necessary
# If zeros or ones are present:
n <- nrow(pollination_data)
pollination_data$adjusted_success <- (pollination_data$pollination_success * (n-1) + 0.5) / n

# Fit a beta regression model
poll_model <- betareg(pollination_success ~ floral_density + pollinator_diversity + 
                     temperature + plant_community, 
                     data = pollination_data)

# Summary of the model
summary(poll_model)

# Try a variable precision model
var_prec_poll <- betareg(pollination_success ~ floral_density + pollinator_diversity + 
                        temperature + plant_community | temperature, 
                        data = pollination_data)

# Compare models
AIC(poll_model, var_prec_poll)

# Calculate odds ratios for interpretation
exp(coef(poll_model)[1:6])

# Visualize effects (example for floral density)
new_data <- expand.grid(
  floral_density = seq(min(pollination_data$floral_density), 
                       max(pollination_data$floral_density), 
                       length.out = 100),
  pollinator_diversity = mean(pollination_data$pollinator_diversity),
  temperature = mean(pollination_data$temperature),
  plant_community = levels(pollination_data$plant_community)
)

new_data$predicted <- predict(poll_model, newdata = new_data, type = "response")

# Plot
ggplot() +
  geom_point(data = pollination_data, 
             aes(x = floral_density, y = pollination_success, color = plant_community),
             alpha = 0.5) +
  geom_line(data = new_data, 
            aes(x = floral_density, y = predicted, color = plant_community),
            size = 1.2) +
  labs(
    title = "Effect of Floral Density on Pollination Success",
    subtitle = "By plant community type",
    x = "Floral Density",
    y = "Pollination Success Rate"
  ) +
  theme_bw()

6.16 Additional Resources

  1. Ferrari, S. L. P., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31(7), 799-815.

  2. Cribari-Neto, F., & Zeileis, A. (2010). Beta regression in R. Journal of Statistical Software, 34(2), 1-24.

  3. Espinheira, P. L., Ferrari, S. L. P., & Cribari-Neto, F. (2008). On beta regression residuals. Journal of Applied Statistics, 35(4), 407-419.

  4. Simas, A. B., Barreto-Souza, W., & Rocha, A. V. (2010). Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis, 54(2), 348-366.

  5. Zeileis, A., Cribari-Neto, F., & Grün, B. (2023). betareg: Beta Regression. R package version 3.1-4.

  6. Smithson, M., & Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54-71.