Chapter 3: Logistic Regression for Binary Ecological Data

Author

Steffen Foerster

3.1 From Continuous to Binary: When Your Response is Yes or No

In introductory statistics, we learned about linear regression for continuous response variables - things we can measure on a scale, like height, weight, or temperature. But ecological research often involves questions where the response is binary: present or absent, alive or dead, infected or healthy, male or female.

Common binary responses in ecology and evolution

Binary (yes/no) data are everywhere in ecological and evolutionary studies:

  • Species presence/absence in surveys
  • Survival or mortality of organisms
  • Infection status (infected vs. healthy)
  • Breeding success (bred vs. did not breed)
  • Sex determination (male vs. female)
  • Behavioral choices (displaying vs. not displaying)
  • Habitat use (used vs. not used)

In Chapter 2, we learned about the GLM framework and how different distributions and link functions allow us to model various types of data. Now we’ll apply this framework to binary data using the binomial family with a logit link.

Let’s start by seeing why we can’t just use linear regression for binary data.

library(tidyverse)    # for data manipulation and visualization
library(broom)        # for tidying model outputs
library(performance)  # for model checking
library(pROC)         # for ROC curves
library(effects)      # for visualizing effects
library(car)          # for additional diagnostics

3.2 The Problem with Linear Regression for Binary Data

Let’s explore why linear regression fails for binary responses. We’ll use a realistic ecological example: the presence or absence of a rare salamander species across different forest sites.

3.2.1 The Dataset

We’ll analyze data from a field survey of rare salamander populations across forest fragments in the Pacific Northwest. Researchers surveyed 200 forest sites to understand which environmental factors predict salamander presence or absence.

Dataset: salamander_presence.csv

Variables: - presence: Salamander detected (1) or not detected (0) during survey - canopy_cover: Percent canopy cover at the site (0-100%) - soil_moisture: Soil moisture index measured with a probe (unitless, higher = wetter) - leaf_litter: Depth of leaf litter layer in centimeters

Study context: Salamanders are moisture-dependent amphibians that require specific microhabitat conditions. Understanding their habitat associations is critical for conservation planning in fragmented forests.

3.2.2 Loading and Exploring the Data

Let’s start by loading and examining our salamander survey data:

# Load the data
salamander_data <- read.csv("data/salamander_presence.csv")

# Examine structure
str(salamander_data)
'data.frame':   200 obs. of  5 variables:
 $ presence       : int  0 1 0 1 1 1 0 1 0 1 ...
 $ canopy_cover   : num  28.8 78.8 40.9 88.3 94 ...
 $ soil_moisture  : num  39.3 53.9 46.3 44.8 35.7 ...
 $ leaf_litter    : num  4.74 13.73 4.52 6.37 3.48 ...
 $ presence_factor: chr  "Absent" "Present" "Absent" "Present" ...
# Summary statistics
summary(salamander_data)
    presence      canopy_cover      soil_moisture    leaf_litter      
 Min.   :0.000   Min.   : 0.06248   Min.   :19.20   Min.   : 0.02383  
 1st Qu.:0.000   1st Qu.:27.21181   1st Qu.:40.72   1st Qu.: 4.92725  
 Median :1.000   Median :48.20961   Median :48.80   Median : 9.84129  
 Mean   :0.635   Mean   :50.63919   Mean   :50.10   Mean   : 9.87122  
 3rd Qu.:1.000   3rd Qu.:73.33708   3rd Qu.:59.16   3rd Qu.:15.10400  
 Max.   :1.000   Max.   :99.42698   Max.   :98.62   Max.   :19.89873  
 presence_factor   
 Length:200        
 Class :character  
 Mode  :character  
                   
                   
                   
# Check for missing values
cat("\nMissing values:\n")

Missing values:
colSums(is.na(salamander_data))
       presence    canopy_cover   soil_moisture     leaf_litter presence_factor 
              0               0               0               0               0 
# Presence/absence counts
cat("\nSalamander detection:\n")

Salamander detection:
table(salamander_data$presence)

  0   1 
 73 127 
cat("\nProportion present:", round(mean(salamander_data$presence), 3), "\n")

Proportion present: 0.635 
# Convert presence to factor for better plotting
salamander_data$presence_factor <- factor(salamander_data$presence, 
                                          levels = c(0, 1), 
                                          labels = c("Absent", "Present"))

# Create exploratory visualizations
p1 <- ggplot(salamander_data, aes(x = presence_factor, y = canopy_cover, 
                                   fill = presence_factor)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Absent" = "coral", "Present" = "cyan4")) +
  labs(title = "Canopy Cover by Presence", 
       x = "Salamander Status", 
       y = "Canopy Cover (%)") +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(salamander_data, aes(x = presence_factor, y = soil_moisture, 
                                   fill = presence_factor)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Absent" = "coral", "Present" = "cyan4")) +
  labs(title = "Soil Moisture by Presence", 
       x = "Salamander Status", 
       y = "Soil Moisture Index") +
  theme_minimal() +
  theme(legend.position = "none")

p3 <- ggplot(salamander_data, aes(x = presence_factor, y = leaf_litter, 
                                   fill = presence_factor)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Absent" = "coral", "Present" = "cyan4")) +
  labs(title = "Leaf Litter by Presence", 
       x = "Salamander Status", 
       y = "Litter Depth (cm)") +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(salamander_data, aes(x = canopy_cover, y = soil_moisture, 
                                   color = presence_factor)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("Absent" = "coral", "Present" = "cyan4")) +
  labs(title = "Bivariate Pattern", 
       x = "Canopy Cover (%)", 
       y = "Soil Moisture Index",
       color = "Status") +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

Exploratory analysis of salamander presence patterns
Initial observations from data exploration

From our exploration:

Sample characteristics: - 200 forest sites surveyed - Salamanders detected at sites representing a reasonable detection rate - Balance between presence/absence is adequate for modeling - No missing data - clean dataset

Environmental patterns: - Sites with salamanders tend to have higher canopy cover - Soil moisture appears higher at presence sites - Leaf litter depth is notably greater where salamanders are present - The bivariate plot suggests sites with both high canopy and high moisture are more likely to have salamanders

These patterns make biological sense - salamanders require moist, shaded microhabitats with protective cover. Our modeling will quantify these relationships!

3.2.3 Why Linear Regression Fails

Now let’s see what goes wrong when we try linear regression with this binary data:

# Fit a linear regression (wrong approach)
linear_model <- lm(presence ~ canopy_cover, data = salamander_data)

# Create prediction grid
pred_grid <- data.frame(canopy_cover = seq(0, 100, length.out = 100))
pred_grid$linear_pred <- predict(linear_model, newdata = pred_grid)

# Plot the problem
ggplot(salamander_data, aes(x = canopy_cover, y = presence)) +
  geom_point(alpha = 0.5, position = position_jitter(height = 0.02)) +
  geom_line(data = pred_grid, aes(y = linear_pred), color = "red", size = 1.2) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "gray50") +
  labs(
    title = "Linear Regression Applied to Binary Data (WRONG!)",
    subtitle = "Red line shows predictions that can fall outside [0,1]",
    x = "Canopy Cover (%)",
    y = "Salamander Presence (0 = Absent, 1 = Present)"
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Why linear regression fails for binary data
Three critical problems with linear regression for binary data
  1. Predictions outside [0,1]: The red line shows that linear regression can predict values less than 0 or greater than 1, which makes no sense for a probability.

  2. Non-constant variance: The variance of binary data depends on the mean. When the probability is near 0.5, there’s maximum variance; when it’s near 0 or 1, there’s minimal variance. Linear regression assumes constant variance (homoscedasticity).

  3. Non-linear relationship: The relationship between predictors and probability is inherently non-linear. As we move from low to high probability, equal changes in the predictor have diminishing effects on probability.

These aren’t minor technical issues - they fundamentally violate the assumptions of linear regression and can lead to completely wrong conclusions.

Let’s check the residuals from this linear model:

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(linear_model)

Diagnostic plots showing violations of linear regression assumptions
par(mfrow = c(1, 1))

The diagnostic plots reveal serious problems: the residuals show clear patterns, the Q-Q plot deviates substantially from normality, and the scale-location plot shows obvious heteroscedasticity.

3.3 Enter Logistic Regression: A Better Approach

Logistic regression solves all three problems by:

  1. Using a link function (the logit function) that constrains predictions to [0,1]
  2. Using a binomial distribution that naturally accounts for non-constant variance
  3. Modeling a non-linear relationship between predictors and probability

This is our first example of a Generalized Linear Model (GLM). GLMs extend linear models by allowing different distributions (not just normal) and using link functions to connect the linear predictor to the response.

The logistic regression model can be expressed as:

\[ \begin{align} Y_i &\sim \text{Bernoulli}(p_i) \\ \text{logit}(p_i) &= \log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots \end{align} \]

Where: - \(Y_i\) is the binary outcome (0 or 1) for observation \(i\) - \(p_i\) is the probability of success (presence, survival, etc.) for observation \(i\) - The logit function transforms probabilities [0,1] to log-odds (-∞,∞) - \(\beta_0, \beta_1, \ldots\) are coefficients we estimate from the data

Let’s fit a proper logistic regression to our salamander data:

# Fit logistic regression model
salamander_model <- glm(presence ~ canopy_cover + soil_moisture + leaf_litter,
                       family = binomial(link = "logit"),
                       data = salamander_data)

# View summary
summary(salamander_model)

Call:
glm(formula = presence ~ canopy_cover + soil_moisture + leaf_litter, 
    family = binomial(link = "logit"), data = salamander_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -7.24193    1.22519  -5.911 3.40e-09 ***
canopy_cover   0.07017    0.01109   6.330 2.46e-10 ***
soil_moisture  0.05721    0.01515   3.777 0.000159 ***
leaf_litter    0.19278    0.04141   4.655 3.23e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 262.50  on 199  degrees of freedom
Residual deviance: 161.78  on 196  degrees of freedom
AIC: 169.78

Number of Fisher Scoring iterations: 6

Now let’s visualize how logistic regression properly models the relationship:

# Create prediction grid (holding other variables at their means)
pred_grid_logistic <- data.frame(
  canopy_cover = seq(0, 100, length.out = 100),
  soil_moisture = mean(salamander_data$soil_moisture),
  leaf_litter = mean(salamander_data$leaf_litter)
)

# Get predictions
pred_grid_logistic$probability <- predict(salamander_model, 
                                         newdata = pred_grid_logistic, 
                                         type = "response")

# Plot
ggplot(salamander_data, aes(x = canopy_cover, y = presence)) +
  geom_point(alpha = 0.5, position = position_jitter(height = 0.02)) +
  geom_line(data = pred_grid_logistic, aes(y = probability), 
            color = "blue", size = 1.2) +
  labs(
    title = "Logistic Regression for Binary Data (CORRECT!)",
    subtitle = "Blue curve shows predicted probabilities that stay within [0,1]",
    x = "Canopy Cover (%)",
    y = "Probability of Salamander Presence"
  ) +
  theme_minimal()

Logistic regression properly constrains predictions to [0,1]
The S-shaped logistic curve

Notice the characteristic S-shaped (sigmoid) curve. This shape makes ecological sense:

  • At very low canopy cover, the probability of salamander presence is near zero and changes slowly
  • In the middle range, probability increases rapidly with canopy cover
  • At very high canopy cover, probability approaches 1 and changes slowly again

This non-linear relationship captures the reality that species often have threshold responses to environmental conditions - small changes in marginal habitat have little effect, while changes in intermediate habitat have large effects.

3.5 Odds, Odds Ratios, and Why They Matter

One of the most challenging aspects of logistic regression is understanding odds and odds ratios. These are fundamentally different from probabilities, but they’re crucial for interpreting logistic regression coefficients.

3.5.1 From Probability to Odds

Probability is what we’re most familiar with: the chance of an event occurring, ranging from 0 to 1 (or 0% to 100%).

Odds express the same information differently: the ratio of the probability of success to the probability of failure.

\[\text{Odds} = \frac{p}{1-p}\]

Let’s see some examples:

# Create a table of probability and odds conversions
prob_values <- c(0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99)
odds_values <- prob_values / (1 - prob_values)

comparison_df <- data.frame(
  Probability = prob_values,
  Odds = round(odds_values, 3),
  Interpretation = c(
    "1 in 10 chance",
    "1 in 4 chance",
    "50-50 chance",
    "3 in 4 chance",
    "9 in 10 chance",
    "19 in 20 chance",
    "99 in 100 chance"
  ),
  Odds_Interpretation = c(
    "1 to 9 odds",
    "1 to 3 odds",
    "1 to 1 odds (even)",
    "3 to 1 odds",
    "9 to 1 odds",
    "19 to 1 odds",
    "99 to 1 odds"
  )
)

print(comparison_df)
  Probability   Odds   Interpretation Odds_Interpretation
1        0.10  0.111   1 in 10 chance         1 to 9 odds
2        0.25  0.333    1 in 4 chance         1 to 3 odds
3        0.50  1.000     50-50 chance  1 to 1 odds (even)
4        0.75  3.000    3 in 4 chance         3 to 1 odds
5        0.90  9.000   9 in 10 chance         9 to 1 odds
6        0.95 19.000  19 in 20 chance        19 to 1 odds
7        0.99 99.000 99 in 100 chance        99 to 1 odds
Understanding odds vs. probability

Key differences:

  • Probability ranges from 0 to 1
  • Odds range from 0 to ∞
  • When probability = 0.5, odds = 1 (even odds)
  • When probability < 0.5, odds < 1
  • When probability > 0.5, odds > 1

Example: If a salamander has a 0.75 probability of being present at a site: - We might say “there’s a 75% chance of finding it” - In odds terms: “the odds are 3 to 1 in favor of finding it” (0.75/0.25 = 3)

3.5.2 Odds Ratios: The Heart of Logistic Regression

When we exponentiate logistic regression coefficients, we get odds ratios (ORs). An odds ratio tells us how the odds of success change when we increase a predictor by one unit.

# Extract coefficients
coefs <- coef(salamander_model)

# Calculate odds ratios
odds_ratios <- exp(coefs)

# Create interpretation table
or_table <- data.frame(
  Variable = names(coefs),
  Coefficient = round(coefs, 4),
  Odds_Ratio = round(odds_ratios, 4),
  Interpretation = c(
    "Baseline log-odds when all predictors = 0",
    "OR for 1% increase in canopy cover",
    "OR for 1-unit increase in soil moisture",
    "OR for 1 cm increase in leaf litter"
  )
)

print(or_table)
                   Variable Coefficient Odds_Ratio
(Intercept)     (Intercept)     -7.2419     0.0007
canopy_cover   canopy_cover      0.0702     1.0727
soil_moisture soil_moisture      0.0572     1.0589
leaf_litter     leaf_litter      0.1928     1.2126
                                         Interpretation
(Intercept)   Baseline log-odds when all predictors = 0
canopy_cover         OR for 1% increase in canopy cover
soil_moisture   OR for 1-unit increase in soil moisture
leaf_litter         OR for 1 cm increase in leaf litter
Interpreting odds ratios

For our salamander model:

Canopy cover (OR ≈ 1.073): - For each 1% increase in canopy cover, the odds of salamander presence increase by a factor of 1.073 - Or: the odds increase by about 7.3% per 1% increase in canopy cover - A 10% increase in canopy cover multiplies the odds by 2.017

Leaf litter (OR ≈ 1.213): - For each 1 cm increase in leaf litter depth, the odds of presence increase by a factor of 1.213 - This is a 21.3% increase in odds per cm

Key insight: Odds ratios > 1 indicate a positive effect, odds ratios < 1 indicate a negative effect, and an odds ratio = 1 indicates no effect.

3.5.3 Odds Ratios vs. Relative Risk

Students often confuse odds ratios with relative risk (also called risk ratios). Let’s clarify the difference:

Relative Risk (RR) = Probability in exposed group / Probability in unexposed group

Odds Ratio (OR) = (Odds in exposed group) / (Odds in unexposed group)

# Create comparison data
baseline_prob <- seq(0.01, 0.9, by = 0.01)
exposed_prob <- pmin(baseline_prob * 2, 0.99)  # Double the risk, capped at 0.99

# Calculate OR and RR
baseline_odds <- baseline_prob / (1 - baseline_prob)
exposed_odds <- exposed_prob / (1 - exposed_prob)
or_values <- exposed_odds / baseline_odds
rr_values <- exposed_prob / baseline_prob

or_rr_df <- data.frame(
  baseline_prob = baseline_prob,
  odds_ratio = or_values,
  relative_risk = rr_values
)

# Plot
ggplot(or_rr_df, aes(x = baseline_prob)) +
  geom_line(aes(y = odds_ratio, color = "Odds Ratio"), size = 1.2) +
  geom_line(aes(y = relative_risk, color = "Relative Risk"), size = 1.2) +
  scale_color_manual(values = c("Odds Ratio" = "blue", "Relative Risk" = "red")) +
  labs(
    title = "Odds Ratio vs. Relative Risk",
    subtitle = "When exposure doubles the probability of the outcome",
    x = "Baseline Probability",
    y = "Effect Measure",
    color = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "top")

Odds ratios vs. relative risk at different baseline probabilities
When odds ratios approximate relative risk

Key points about OR vs. RR:

  1. When the outcome is rare (probability < 0.1), odds ratios closely approximate relative risk
  2. When the outcome is common (probability > 0.3), odds ratios can substantially overestimate relative risk
  3. Odds ratios are always further from 1 than relative risk (more extreme)
  4. In logistic regression, we estimate odds ratios, not relative risk

Example: If baseline probability is 0.05 (5%) and exposure doubles it to 0.10 (10%): - Relative risk = 0.10 / 0.05 = 2.0 - Odds ratio = (0.10/0.90) / (0.05/0.95) ≈ 2.11

But if baseline probability is 0.40 (40%) and exposure increases it to 0.60 (60%): - Relative risk = 0.60 / 0.40 = 1.5 - Odds ratio = (0.60/0.40) / (0.40/0.60) = 2.25 (much larger!)

For ecological data where presence/absence is often relatively rare, odds ratios usually provide reasonable approximations of relative risk.

3.5.4 Converting Between Odds, Probability, and Log-Odds

Let’s create a comprehensive visualization showing the relationships:

# Create conversion data
prob_seq <- seq(0.01, 0.99, length.out = 200)
odds_seq <- prob_seq / (1 - prob_seq)
logodds_seq <- log(odds_seq)

conversion_df <- data.frame(
  probability = prob_seq,
  odds = odds_seq,
  log_odds = logodds_seq
)

# Create three panels
p1 <- ggplot(conversion_df, aes(x = probability, y = odds)) +
  geom_line(color = "darkblue", size = 1.2) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "gray50") +
  labs(
    title = "A) Probability to Odds",
    x = "Probability",
    y = "Odds"
  ) +
  theme_minimal()

p2 <- ggplot(conversion_df, aes(x = probability, y = log_odds)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "gray50") +
  labs(
    title = "B) Probability to Log-Odds",
    x = "Probability",
    y = "Log-Odds (Logit)"
  ) +
  theme_minimal()

p3 <- ggplot(conversion_df, aes(x = odds, y = log_odds)) +
  geom_line(color = "darkred", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 1, linetype = "dashed", color = "gray50") +
  labs(
    title = "C) Odds to Log-Odds",
    x = "Odds",
    y = "Log-Odds"
  ) +
  xlim(0, 10) +
  theme_minimal()

# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, ncol = 2)
Warning: Removed 17 rows containing missing values (`geom_line()`).

Relationships between probability, odds, and log-odds
Quick reference for conversions

From probability (p) to odds: \(\text{Odds} = \frac{p}{1-p}\)

From odds to probability: \(p = \frac{\text{Odds}}{1 + \text{Odds}}\)

From probability to log-odds: \(\text{Log-odds} = \log\left(\frac{p}{1-p}\right)\)

From log-odds to probability: \(p = \frac{e^{\text{log-odds}}}{1 + e^{\text{log-odds}}} = \frac{1}{1 + e^{-\text{log-odds}}}\)

In R: - plogis(x) converts log-odds to probability - qlogis(p) converts probability to log-odds

3.6 Interpreting Logistic Regression Coefficients

Now that we understand odds and odds ratios, let’s dive deeper into interpreting our salamander model:

# Get confidence intervals for coefficients
conf_ints <- confint(salamander_model)
Waiting for profiling to be done...
# Create detailed interpretation table
detailed_table <- data.frame(
  Variable = names(coef(salamander_model)),
  Coefficient = round(coef(salamander_model), 4),
  SE = round(summary(salamander_model)$coefficients[, "Std. Error"], 4),
  OR = round(exp(coef(salamander_model)), 4),
  OR_Lower = round(exp(conf_ints[, 1]), 4),
  OR_Upper = round(exp(conf_ints[, 2]), 4),
  p_value = round(summary(salamander_model)$coefficients[, "Pr(>|z|)"], 4)
)

print(detailed_table)
                   Variable Coefficient     SE     OR OR_Lower OR_Upper p_value
(Intercept)     (Intercept)     -7.2419 1.2252 0.0007   0.0001   0.0066   0e+00
canopy_cover   canopy_cover      0.0702 0.0111 1.0727   1.0515   1.0985   0e+00
soil_moisture soil_moisture      0.0572 0.0151 1.0589   1.0293   1.0926   2e-04
leaf_litter     leaf_litter      0.1928 0.0414 1.2126   1.1234   1.3228   0e+00
Step-by-step interpretation guide

Let’s interpret the canopy_cover coefficient as an example:

1. On the log-odds scale (coefficient = 0.0702): - For each 1% increase in canopy cover, the log-odds of salamander presence increase by 0.0702 - This is the scale of the actual model estimation, but hard to interpret intuitively

2. On the odds scale (OR = 1.0727): - For each 1% increase in canopy cover, the odds of presence are multiplied by 1.0727 - This is a 7.27% increase in odds - For a 10% increase: odds are multiplied by 2.017

3. On the probability scale (requires baseline): - The effect on probability depends on where you start - We’ll visualize this with effect plots below

3.6.1 Visualizing Effects on the Probability Scale

While odds ratios are what we estimate, probabilities are often more intuitive for interpretation and presentation:

# Create prediction grids for each variable
grid_canopy <- data.frame(
  canopy_cover = seq(0, 100, length.out = 100),
  soil_moisture = mean(salamander_data$soil_moisture),
  leaf_litter = mean(salamander_data$leaf_litter)
)

grid_moisture <- data.frame(
  canopy_cover = mean(salamander_data$canopy_cover),
  soil_moisture = seq(min(salamander_data$soil_moisture), 
                     max(salamander_data$soil_moisture), 
                     length.out = 100),
  leaf_litter = mean(salamander_data$leaf_litter)
)

grid_litter <- data.frame(
  canopy_cover = mean(salamander_data$canopy_cover),
  soil_moisture = mean(salamander_data$soil_moisture),
  leaf_litter = seq(0, 20, length.out = 100)
)

# Get predictions with confidence intervals
get_predictions <- function(model, newdata) {
  pred <- predict(model, newdata = newdata, type = "link", se.fit = TRUE)
  newdata$prob <- plogis(pred$fit)
  newdata$lower <- plogis(pred$fit - 1.96 * pred$se.fit)
  newdata$upper <- plogis(pred$fit + 1.96 * pred$se.fit)
  return(newdata)
}

grid_canopy <- get_predictions(salamander_model, grid_canopy)
grid_moisture <- get_predictions(salamander_model, grid_moisture)
grid_litter <- get_predictions(salamander_model, grid_litter)

# Create plots
p1 <- ggplot(grid_canopy, aes(x = canopy_cover, y = prob)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "blue") +
  geom_line(color = "blue", size = 1.2) +
  labs(
    title = "Effect of Canopy Cover",
    x = "Canopy Cover (%)",
    y = "Probability of Presence"
  ) +
  ylim(0, 1) +
  theme_minimal()

p2 <- ggplot(grid_moisture, aes(x = soil_moisture, y = prob)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "darkgreen") +
  geom_line(color = "darkgreen", size = 1.2) +
  labs(
    title = "Effect of Soil Moisture",
    x = "Soil Moisture Index",
    y = "Probability of Presence"
  ) +
  ylim(0, 1) +
  theme_minimal()

p3 <- ggplot(grid_litter, aes(x = leaf_litter, y = prob)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "brown") +
  geom_line(color = "brown", size = 1.2) +
  labs(
    title = "Effect of Leaf Litter Depth",
    x = "Leaf Litter Depth (cm)",
    y = "Probability of Presence"
  ) +
  ylim(0, 1) +
  theme_minimal()

# Arrange
gridExtra::grid.arrange(p1, p2, p3, ncol = 2)

Effects of each predictor on salamander presence probability
The non-linear nature of probability changes

Notice in the plots above that the effect of predictors on probability is non-linear:

  1. When baseline probability is low, increases in predictors have small effects on probability
  2. When baseline probability is moderate (around 0.5), the same increases have large effects
  3. When baseline probability is already high, increases have small effects again

This is why we model on the log-odds scale where effects are linear and consistent. The same one-unit change in a predictor always has the same multiplicative effect on the odds, regardless of baseline probability.

Example: A 10% increase in canopy cover: - From 20% to 30% might increase probability from 0.05 to 0.09 - From 50% to 60% might increase probability from 0.35 to 0.50
- From 80% to 90% might increase probability from 0.80 to 0.85

But in all three cases, the odds ratio is the same!

3.7 Model Diagnostics for Logistic Regression

Unlike linear regression, logistic regression diagnostics are less straightforward. We’ll explore several complementary approaches.

3.7.1 Deviance and Residual Deviance

Deviance in logistic regression is analogous to the residual sum of squares in linear regression. It measures how well the model fits the data.

# Extract deviance information
null_deviance <- salamander_model$null.deviance
residual_deviance <- salamander_model$deviance
df_null <- salamander_model$df.null
df_residual <- salamander_model$df.residual

# Calculate pseudo R-squared (McFadden's R-squared)
pseudo_r2 <- 1 - (residual_deviance / null_deviance)

# Display
cat("Null Deviance:", round(null_deviance, 2), "on", df_null, "degrees of freedom\n")
Null Deviance: 262.5 on 199 degrees of freedom
cat("Residual Deviance:", round(residual_deviance, 2), "on", df_residual, "degrees of freedom\n")
Residual Deviance: 161.78 on 196 degrees of freedom
cat("Reduction in Deviance:", round(null_deviance - residual_deviance, 2), "\n")
Reduction in Deviance: 100.72 
cat("McFadden's Pseudo R²:", round(pseudo_r2, 4), "\n")
McFadden's Pseudo R²: 0.3837 
# Likelihood ratio test
lrtest_stat <- null_deviance - residual_deviance
lrtest_df <- df_null - df_residual
lrtest_p <- pchisq(lrtest_stat, lrtest_df, lower.tail = FALSE)

cat("\nLikelihood Ratio Test:\n")

Likelihood Ratio Test:
cat("Chi-squared =", round(lrtest_stat, 2), "on", lrtest_df, "df, p-value =", 
    format.pval(lrtest_p, digits = 4), "\n")
Chi-squared = 100.72 on 3 df, p-value = < 2.2e-16 
Understanding deviance

Null deviance: Deviance of a model with only an intercept (baseline model)

Residual deviance: Deviance of our fitted model

Reduction in deviance: How much better our model is than the null model

Interpretation: - The significant likelihood ratio test tells us our model is significantly better than the null - The pseudo R² of 0.384 suggests our predictors explain about 38.4% of the deviance - Note: Pseudo R² values are generally lower than R² values in linear regression

3.7.2 Residual Plots

For logistic regression, we use several types of residuals:

# Calculate different types of residuals
salamander_data$pearson_resid <- residuals(salamander_model, type = "pearson")
salamander_data$deviance_resid <- residuals(salamander_model, type = "deviance")
salamander_data$fitted <- fitted(salamander_model)

# Create diagnostic plots
p1 <- ggplot(salamander_data, aes(x = fitted, y = deviance_resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(
    title = "Deviance Residuals vs. Fitted",
    x = "Fitted Probability",
    y = "Deviance Residuals"
  ) +
  theme_minimal()

p2 <- ggplot(salamander_data, aes(sample = deviance_resid)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(
    title = "Q-Q Plot of Deviance Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

p3 <- ggplot(salamander_data, aes(x = seq_along(deviance_resid), y = deviance_resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-3, 3), linetype = "dashed", color = "orange") +
  labs(
    title = "Index Plot of Deviance Residuals",
    x = "Observation Index",
    y = "Deviance Residuals"
  ) +
  theme_minimal()

p4 <- ggplot(salamander_data, aes(x = fitted, y = abs(sqrt(abs(deviance_resid))))) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Scale-Location Plot",
    x = "Fitted Probability",
    y = "√|Deviance Residuals|"
  ) +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Residual diagnostic plots for logistic regression
Interpreting logistic regression residual plots

Deviance Residuals vs. Fitted: - Look for random scatter around zero - Patterns suggest model misspecification - The smoothed line should be roughly flat

Q-Q Plot: - Assesses whether residuals follow expected distribution - Binary data will show some discreteness, but gross deviations are concerning

Index Plot: - Identifies potential outliers (residuals beyond ±3) - Patterns suggest temporal or spatial structure

Scale-Location: - Checks for constant variance (homoscedasticity) - Ideally, the smoothed line should be roughly flat

3.7.3 Influential Observations

Let’s identify observations that have strong influence on our model:

# Calculate influence measures
influence_measures <- influence.measures(salamander_model)

# Extract Cook's distance and leverage
salamander_data$cooks_d <- cooks.distance(salamander_model)
salamander_data$leverage <- hatvalues(salamander_model)

# Identify influential points
influential_threshold <- 4 / nrow(salamander_data)
salamander_data$influential <- salamander_data$cooks_d > influential_threshold

# Create plots
p1 <- ggplot(salamander_data, aes(x = seq_along(cooks_d), y = cooks_d)) +
  geom_bar(stat = "identity", aes(fill = influential)) +
  geom_hline(yintercept = influential_threshold, linetype = "dashed", color = "red") +
  scale_fill_manual(values = c("FALSE" = "gray70", "TRUE" = "red")) +
  labs(
    title = "Cook's Distance",
    x = "Observation Index",
    y = "Cook's Distance"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(salamander_data, aes(x = leverage, y = deviance_resid)) +
  geom_point(aes(color = influential), alpha = 0.6) +
  scale_color_manual(values = c("FALSE" = "gray70", "TRUE" = "red")) +
  labs(
    title = "Residuals vs. Leverage",
    x = "Leverage",
    y = "Deviance Residuals"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

gridExtra::grid.arrange(p1, p2, ncol = 1)

Diagnostic plots for influential observations
# Print information about influential points
n_influential <- sum(salamander_data$influential)
cat("Number of influential observations:", n_influential, "\n")
Number of influential observations: 14 
if (n_influential > 0) {
  cat("Indices of influential observations:", which(salamander_data$influential), "\n")
}
Indices of influential observations: 6 24 47 53 95 96 103 109 116 124 141 165 183 197 

3.7.4 Checking for Separation

Complete separation occurs when a predictor (or combination of predictors) perfectly predicts the outcome. This causes problems for logistic regression.

# Check for perfect prediction by creating boxplots
p1 <- ggplot(salamander_data, aes(x = presence_factor, y = canopy_cover, fill = presence_factor)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Absent" = "red", "Present" = "blue")) +
  labs(
    title = "Canopy Cover by Presence/Absence",
    x = "Salamander Status",
    y = "Canopy Cover (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(salamander_data, aes(x = presence_factor, y = leaf_litter, fill = presence_factor)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Absent" = "red", "Present" = "blue")) +
  labs(
    title = "Leaf Litter by Presence/Absence",
    x = "Salamander Status",
    y = "Leaf Litter Depth (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Checking for potential separation issues
What is separation and why is it a problem?

Complete separation occurs when: - All presences occur above a certain threshold of a predictor - All absences occur below that threshold - The groups don’t overlap at all

Quasi-complete separation occurs when: - Groups mostly don’t overlap, with only a few exceptions

Problems caused by separation: - Coefficients become unrealistically large - Standard errors become huge - Confidence intervals become infinitely wide - Maximum likelihood estimation fails to converge properly

What to do if you detect separation: 1. Check if it’s real (very strong biological effect) or artificial (too few data points) 2. Consider collapsing categories or binning continuous predictors 3. Use penalized regression (e.g., Firth’s logistic regression) 4. Collect more data in the problematic range

In our salamander example, there’s good overlap in predictor values between presence and absence, so separation is not a concern.

3.8 Model Performance: ROC Curves and AUC

An important way to evaluate logistic regression models is through ROC curves and the Area Under the Curve (AUC).

# Calculate predicted probabilities
salamander_data$predicted_prob <- predict(salamander_model, type = "response")

# Create ROC curve
roc_obj <- roc(salamander_data$presence, salamander_data$predicted_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Get AUC
auc_value <- auc(roc_obj)

# Plot ROC curve
plot(roc_obj, 
     main = "ROC Curve for Salamander Presence Model",
     col = "blue", 
     lwd = 2,
     print.auc = TRUE,
     auc.polygon = TRUE,
     auc.polygon.col = "lightblue",
     grid = TRUE)

# Add diagonal reference line
abline(a = 0, b = 1, lty = 2, col = "gray")

ROC curve for salamander presence model
# Find optimal threshold using Youden's index
coords_roc <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"))
cat("\nOptimal threshold (Youden's Index):", round(coords_roc$threshold, 3), "\n")

Optimal threshold (Youden's Index): 0.723 
cat("Sensitivity at optimal threshold:", round(coords_roc$sensitivity, 3), "\n")
Sensitivity at optimal threshold: 0.709 
cat("Specificity at optimal threshold:", round(coords_roc$specificity, 3), "\n")
Specificity at optimal threshold: 0.918 
Understanding ROC curves and AUC

ROC Curve (Receiver Operating Characteristic): - Plots True Positive Rate (Sensitivity) vs. False Positive Rate (1 - Specificity) - Shows the trade-off between correctly identifying presences and incorrectly calling absences as presences - Each point represents a different probability threshold for classification

AUC (Area Under the Curve): - Ranges from 0 to 1 - 0.5 = random guessing (diagonal line) - 1.0 = perfect discrimination - General guidelines: - 0.90-1.00: Excellent - 0.80-0.90: Good - 0.70-0.80: Fair - 0.60-0.70: Poor - 0.50-0.60: Fail

Our model’s AUC = 0.885, which indicates good discriminatory ability.

Optimal threshold: The point on the ROC curve that maximizes sensitivity + specificity. For presence/absence studies, you might choose different thresholds based on the costs of false positives vs. false negatives.

3.8.1 Classification Tables and Prediction Accuracy

Let’s examine how well our model classifies observations at different thresholds:

# Function to create classification table
create_class_table <- function(observed, predicted_prob, threshold) {
  predicted_class <- ifelse(predicted_prob >= threshold, 1, 0)
  table(Observed = observed, Predicted = predicted_class)
}

# Try different thresholds
thresholds <- c(0.3, 0.5, 0.7)

for (thresh in thresholds) {
  cat("\n=== Threshold =", thresh, "===\n")
  class_tab <- create_class_table(salamander_data$presence, 
                                  salamander_data$predicted_prob, 
                                  thresh)
  print(class_tab)
  
  # Calculate metrics
  tn <- class_tab[1, 1]  # True negatives
  fp <- class_tab[1, 2]  # False positives
  fn <- class_tab[2, 1]  # False negatives
  tp <- class_tab[2, 2]  # True positives
  
  accuracy <- (tp + tn) / sum(class_tab)
  sensitivity <- tp / (tp + fn)  # True positive rate
  specificity <- tn / (tn + fp)  # True negative rate
  precision <- tp / (tp + fp)     # Positive predictive value
  
  cat("Accuracy:", round(accuracy, 3), "\n")
  cat("Sensitivity (True Positive Rate):", round(sensitivity, 3), "\n")
  cat("Specificity (True Negative Rate):", round(specificity, 3), "\n")
  cat("Precision (Positive Predictive Value):", round(precision, 3), "\n")
}

=== Threshold = 0.3 ===
        Predicted
Observed   0   1
       0  36  37
       1   6 121
Accuracy: 0.785 
Sensitivity (True Positive Rate): 0.953 
Specificity (True Negative Rate): 0.493 
Precision (Positive Predictive Value): 0.766 

=== Threshold = 0.5 ===
        Predicted
Observed   0   1
       0  50  23
       1  17 110
Accuracy: 0.8 
Sensitivity (True Positive Rate): 0.866 
Specificity (True Negative Rate): 0.685 
Precision (Positive Predictive Value): 0.827 

=== Threshold = 0.7 ===
        Predicted
Observed  0  1
       0 63 10
       1 33 94
Accuracy: 0.785 
Sensitivity (True Positive Rate): 0.74 
Specificity (True Negative Rate): 0.863 
Precision (Positive Predictive Value): 0.904 
Choosing a classification threshold

The choice of threshold depends on your research goals:

Use a lower threshold (e.g., 0.3) when: - False negatives are costly (missing true presences is bad) - You want high sensitivity (catching most presences) - Example: Conservation - you want to identify all potential habitat

Use a higher threshold (e.g., 0.7) when: - False positives are costly (falsely claiming presence is bad) - You want high specificity (being certain about presences) - Example: Rare species - you need strong evidence before declaring presence

Use the default 0.5 threshold when: - False positives and false negatives are equally costly - You want a balanced approach

For ecological applications, consider the biological and management implications when choosing your threshold.

3.9 Interaction Effects in Logistic Regression

Just like in linear models, predictors can interact in logistic regression. Let’s explore whether the effect of canopy cover depends on soil moisture:

# Fit model with interaction
interaction_model <- glm(presence ~ canopy_cover * soil_moisture + leaf_litter,
                        family = binomial(link = "logit"),
                        data = salamander_data)

# Compare models
anova(salamander_model, interaction_model, test = "Chisq")
Analysis of Deviance Table

Model 1: presence ~ canopy_cover + soil_moisture + leaf_litter
Model 2: presence ~ canopy_cover * soil_moisture + leaf_litter
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       196     161.78                     
2       195     160.67  1   1.1058    0.293
# Summary of interaction model
summary(interaction_model)

Call:
glm(formula = presence ~ canopy_cover * soil_moisture + leaf_litter, 
    family = binomial(link = "logit"), data = salamander_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -8.6030300  1.8372198  -4.683 2.83e-06 ***
canopy_cover                0.1043973  0.0352097   2.965  0.00303 ** 
soil_moisture               0.0837018  0.0297476   2.814  0.00490 ** 
leaf_litter                 0.1934453  0.0414043   4.672 2.98e-06 ***
canopy_cover:soil_moisture -0.0006986  0.0006628  -1.054  0.29191    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 262.50  on 199  degrees of freedom
Residual deviance: 160.67  on 195  degrees of freedom
AIC: 170.67

Number of Fisher Scoring iterations: 6

Let’s visualize the interaction:

# Create prediction grid with three levels of soil moisture
moisture_levels <- quantile(salamander_data$soil_moisture, c(0.25, 0.5, 0.75))

interaction_grid <- expand.grid(
  canopy_cover = seq(0, 100, length.out = 100),
  soil_moisture = moisture_levels,
  leaf_litter = mean(salamander_data$leaf_litter)
)

# Get predictions
interaction_grid$probability <- predict(interaction_model, 
                                       newdata = interaction_grid, 
                                       type = "response")

# Add factor for plotting
interaction_grid$moisture_cat <- factor(
  interaction_grid$soil_moisture,
  levels = moisture_levels,
  labels = c("Low Moisture (25th %ile)", 
            "Medium Moisture (Median)", 
            "High Moisture (75th %ile)")
)

# Plot
ggplot(interaction_grid, aes(x = canopy_cover, y = probability, 
                             color = moisture_cat, linetype = moisture_cat)) +
  geom_line(size = 1.2) +
  scale_color_viridis_d(option = "plasma", end = 0.9) +
  labs(
    title = "Interaction Between Canopy Cover and Soil Moisture",
    subtitle = "Effect of canopy depends on soil moisture level",
    x = "Canopy Cover (%)",
    y = "Probability of Salamander Presence",
    color = NULL,
    linetype = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "top")

Interaction between canopy cover and soil moisture
Interpreting interactions in logistic regression

When an interaction is present, the effect of one predictor depends on the level of another predictor.

Looking at our plot: - At low soil moisture, increasing canopy cover has a modest effect - At medium soil moisture, increasing canopy cover has a stronger effect
- At high soil moisture, the effect is strongest

Interpretation on the odds scale: The interaction coefficient tells us how the odds ratio for canopy cover changes per unit increase in soil moisture.

Statistical significance: The ANOVA likelihood ratio test tells us whether the interaction significantly improves model fit. In this case, the p-value = 0.293.

Interactions are common in ecology - environmental factors often work synergistically. However, only include interactions if they’re biologically meaningful and improve model fit.

3.10 Model Selection with Logistic Regression

Just as with linear models, we can compare competing logistic regression models using information criteria (AIC, BIC) and hypothesis tests.

# Fit several candidate models
model_full <- glm(presence ~ canopy_cover + soil_moisture + leaf_litter,
                 family = binomial, data = salamander_data)

model_no_litter <- glm(presence ~ canopy_cover + soil_moisture,
                      family = binomial, data = salamander_data)

model_no_moisture <- glm(presence ~ canopy_cover + leaf_litter,
                        family = binomial, data = salamander_data)

model_canopy_only <- glm(presence ~ canopy_cover,
                        family = binomial, data = salamander_data)

model_null <- glm(presence ~ 1,
                 family = binomial, data = salamander_data)

# Create model list
models <- list(
  "Full Model" = model_full,
  "No Litter" = model_no_litter,
  "No Moisture" = model_no_moisture,
  "Canopy Only" = model_canopy_only,
  "Null (Intercept Only)" = model_null
)

# Calculate AIC for all models
aic_table <- data.frame(
  Model = names(models),
  AIC = sapply(models, AIC),
  BIC = sapply(models, BIC),
  Deviance = sapply(models, deviance),
  df = sapply(models, function(x) x$df.residual)
)

# Calculate delta AIC and weights
aic_table$Delta_AIC <- aic_table$AIC - min(aic_table$AIC)
aic_table$AIC_weight <- exp(-0.5 * aic_table$Delta_AIC) / 
                        sum(exp(-0.5 * aic_table$Delta_AIC))

# Sort by AIC
aic_table <- aic_table[order(aic_table$AIC), ]
aic_table
                                      Model      AIC      BIC Deviance  df
Full Model                       Full Model 169.7767 182.9699 161.7767 196
No Moisture                     No Moisture 184.7288 194.6237 178.7288 197
No Litter                         No Litter 196.2945 206.1894 190.2945 197
Canopy Only                     Canopy Only 211.4985 218.0952 207.4985 198
Null (Intercept Only) Null (Intercept Only) 264.4963 267.7947 262.4963 199
                      Delta_AIC   AIC_weight
Full Model              0.00000 9.994321e-01
No Moisture            14.95210 5.661686e-04
No Litter              26.51782 1.743744e-06
Canopy Only            41.72185 8.709026e-10
Null (Intercept Only)  94.71968 2.701700e-21
Model selection criteria for logistic regression

AIC (Akaike Information Criterion): - Lower is better - Models within 2 AIC units have substantial support - Balances fit and complexity

BIC (Bayesian Information Criterion): - Lower is better
- Penalizes complexity more strongly than AIC - Preferred when sample size is large

AIC Weights: - Can be interpreted as the probability that each model is the “best” model - Useful for model averaging

Deviance: - Lower is better - Raw measure of model fit without complexity penalty

Interpretation: The full model has the lowest AIC (169.78), with an AIC weight of 0.999, suggesting it’s clearly the best model among those tested. All three predictors contribute meaningfully to explaining salamander presence.

3.10.1 Likelihood Ratio Tests for Nested Models

We can use likelihood ratio tests to formally compare nested models:

# Test if leaf litter is needed (comparing full vs. no litter)
lrt_litter <- anova(model_no_litter, model_full, test = "Chisq")
cat("=== Testing leaf_litter ===\n")
=== Testing leaf_litter ===
print(lrt_litter)
Analysis of Deviance Table

Model 1: presence ~ canopy_cover + soil_moisture
Model 2: presence ~ canopy_cover + soil_moisture + leaf_litter
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       197     190.29                          
2       196     161.78  1   28.518 9.284e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test if soil moisture is needed (comparing full vs. no moisture)
lrt_moisture <- anova(model_no_moisture, model_full, test = "Chisq")
cat("\n=== Testing soil_moisture ===\n")

=== Testing soil_moisture ===
print(lrt_moisture)
Analysis of Deviance Table

Model 1: presence ~ canopy_cover + leaf_litter
Model 2: presence ~ canopy_cover + soil_moisture + leaf_litter
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       197     178.73                          
2       196     161.78  1   16.952 3.833e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test if all predictors together are needed (comparing null vs. full)
lrt_all <- anova(model_null, model_full, test = "Chisq")
cat("\n=== Testing all predictors together ===\n")

=== Testing all predictors together ===
print(lrt_all)
Analysis of Deviance Table

Model 1: presence ~ 1
Model 2: presence ~ canopy_cover + soil_moisture + leaf_litter
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       199     262.50                          
2       196     161.78  3   100.72 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Likelihood ratio tests vs. Wald tests

Wald tests (shown in summary() output): - Test individual coefficients - Assume asymptotic normality - Can be unreliable for small samples

Likelihood ratio tests (shown above): - Compare nested models - More reliable than Wald tests - Preferred for hypothesis testing in logistic regression - The test statistic follows a chi-squared distribution

Both tests should generally agree, but when they differ, trust the likelihood ratio test.

3.11 Sample Size and Power Considerations

Logistic regression requires adequate sample size, particularly in each outcome category.

# Check sample sizes
n_absent <- sum(salamander_data$presence == 0)
n_present <- sum(salamander_data$presence == 1)

cat("Sample sizes:\n")
Sample sizes:
cat("  Absent:", n_absent, "(", round(n_absent/nrow(salamander_data)*100, 1), "%)\n")
  Absent: 73 ( 36.5 %)
cat("  Present:", n_present, "(", round(n_present/nrow(salamander_data)*100, 1), "%)\n")
  Present: 127 ( 63.5 %)
# Events per variable
n_predictors <- length(coef(salamander_model)) - 1  # Exclude intercept
events_per_var <- min(n_absent, n_present) / n_predictors

cat("\nEvents per predictor variable:", round(events_per_var, 1), "\n")

Events per predictor variable: 24.3 
Sample size guidelines for logistic regression

General rules of thumb:

  1. Minimum total sample size: At least 100 observations for stable estimates

  2. Events per variable (EPV): At least 10-20 events (in the rarer outcome category) per predictor variable

    • EPV < 5: High risk of overfitting and bias
    • EPV < 10: Moderate risk
    • EPV ≥ 10: Generally acceptable
    • EPV ≥ 20: Preferred for reliable estimates
  3. Balance: Ideally, don’t have extreme imbalance (e.g., 95% vs. 5%)

    • If severely imbalanced, consider:
      • Collecting more data in the rare category
      • Case-control sampling
      • Penalized regression methods
      • Rare-event logistic regression

Our salamander model has 24.3 events per variable, which is excellent.

3.12 Common Pitfalls and How to Avoid Them

Let’s discuss frequent mistakes students make with logistic regression:

Common mistakes in logistic regression

1. Interpreting coefficients as probabilities - ❌ Wrong: “A one-unit increase in X increases presence probability by β” - ✅ Right: “A one-unit increase in X multiplies the odds by exp(β)”

2. Ignoring the scale of interpretation - Coefficients are on the log-odds scale - Exponentiated coefficients are odds ratios - Effects on probability depend on baseline probability

3. Using the wrong family in glm() - Remember: family = binomial(link = "logit") - Common error: forgetting the family argument (defaults to gaussian)

4. Assuming linearity without checking - The relationship between predictors and log-odds should be linear - Check with splines or polynomial terms if needed

5. Overfitting with too many predictors - Respect the events-per-variable rule - Use model selection carefully

6. Treating predicted probabilities as data - Don’t use predict() output as if it were measured data - Predictions have uncertainty (use confidence intervals)

7. Confusing odds ratios with relative risk - They’re only approximately equal when the outcome is rare - Don’t present odds ratios as if they were relative risks

8. Ignoring complete separation - Check for perfect prediction - Use penalized methods if separation is detected

3.13 A Complete Worked Example: Butterfly Habitat Selection

Let’s work through a complete analysis from start to finish, demonstrating all the concepts we’ve learned.

The Dataset

We’ll analyze survey data on butterfly presence across different habitat types. Researchers conducted visual surveys at 250 sites across meadows, forest edges, and woodlands to understand habitat selection by a focal butterfly species.

Dataset: butterfly_presence.csv

Variables: - presence: Butterfly observed (1) or not observed (0) during survey - temperature: Average temperature during survey period (°C) - flower_density: Number of flowering plants per square meter - habitat_type: Type of habitat (meadow, forest edge, woodland)

Study context: This butterfly species is a generalist pollinator that may show preferences for specific habitat conditions. Understanding these preferences helps inform habitat management for pollinator conservation.

Loading and Exploring the Data

# Load butterfly survey data
butterfly_data <- read.csv("data/butterfly_presence.csv")

# Structure
str(butterfly_data)
'data.frame':   250 obs. of  4 variables:
 $ presence      : int  1 0 0 0 0 0 1 0 0 0 ...
 $ temperature   : num  18.3 28.1 29 18.1 21.4 ...
 $ flower_density: int  14 15 15 10 9 10 12 16 19 12 ...
 $ habitat_type  : chr  "forest edge" "meadow" "forest edge" "meadow" ...
# Summary
summary(butterfly_data)
    presence     temperature    flower_density  habitat_type      
 Min.   :0.00   Min.   :13.39   Min.   : 6.00   Length:250        
 1st Qu.:0.00   1st Qu.:21.56   1st Qu.:13.00   Class :character  
 Median :0.00   Median :25.02   Median :15.00   Mode  :character  
 Mean   :0.26   Mean   :25.16   Mean   :15.29                     
 3rd Qu.:1.00   3rd Qu.:29.04   3rd Qu.:18.00                     
 Max.   :1.00   Max.   :36.40   Max.   :26.00                     
# Convert habitat_type to factor with meaningful level order
butterfly_data$habitat_type <- factor(butterfly_data$habitat_type, 
                                     levels = c("woodland", "forest edge", "meadow"))

# Detection rates
cat("\nButterfly detections:\n")

Butterfly detections:
table(butterfly_data$presence)

  0   1 
185  65 
cat("\nDetection rate:", round(mean(butterfly_data$presence) * 100, 1), "%\n")

Detection rate: 26 %
# Detections by habitat
cat("\nDetections by habitat type:\n")

Detections by habitat type:
table(butterfly_data$presence, butterfly_data$habitat_type)
   
    woodland forest edge meadow
  0       45          79     61
  1        8          20     37

Visual Data Exploration

# Create exploratory plots
p1 <- ggplot(butterfly_data, aes(x = factor(presence), y = temperature, fill = factor(presence))) +
  geom_boxplot() +
  scale_fill_manual(values = c("0" = "lightcoral", "1" = "lightblue")) +
  labs(title = "Temperature by Presence", x = "Presence", y = "Temperature (°C)") +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(butterfly_data, aes(x = factor(presence), y = flower_density, fill = factor(presence))) +
  geom_boxplot() +
  scale_fill_manual(values = c("0" = "lightcoral", "1" = "lightblue")) +
  labs(title = "Flower Density by Presence", x = "Presence", y = "Flowers per m²") +
  theme_minimal() +
  theme(legend.position = "none")

p3 <- ggplot(butterfly_data, aes(x = habitat_type, fill = factor(presence))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("0" = "lightcoral", "1" = "lightblue"), 
                   labels = c("Absent", "Present")) +
  labs(title = "Presence by Habitat Type", x = "Habitat Type", y = "Proportion", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "top")

p4 <- ggplot(butterfly_data, aes(x = temperature, y = flower_density, color = factor(presence))) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("0" = "red", "1" = "blue"), 
                    labels = c("Absent", "Present")) +
  labs(title = "Temperature vs. Flower Density", 
       x = "Temperature (°C)", y = "Flowers per m²", color = NULL) +
  theme_minimal() +
  theme(legend.position = "top")

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

Exploratory analysis of butterfly presence

Fit Candidate Models

# Model 1: Linear effects only
butterfly_m1 <- glm(presence ~ temperature + flower_density + habitat_type,
                   family = binomial, data = butterfly_data)

# Model 2: Add quadratic temperature effect (based on biological expectation)
butterfly_m2 <- glm(presence ~ temperature + I(temperature^2) + flower_density + habitat_type,
                   family = binomial, data = butterfly_data)

# Model 3: Add interaction between flower density and habitat
butterfly_m3 <- glm(presence ~ temperature + I(temperature^2) + 
                              flower_density * habitat_type,
                   family = binomial, data = butterfly_data)

# Compare models
AIC(butterfly_m1, butterfly_m2, butterfly_m3)
             df      AIC
butterfly_m1  5 205.5403
butterfly_m2  6 206.8971
butterfly_m3  8 208.8934

Evaluate Best Model

# Select best model (Model 2 based on AIC)
best_butterfly_model <- butterfly_m2

# Summary
summary(best_butterfly_model)

Call:
glm(formula = presence ~ temperature + I(temperature^2) + flower_density + 
    habitat_type, family = binomial, data = butterfly_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)   
(Intercept)             -0.062053   5.510731  -0.011  0.99102   
temperature              0.046378   0.483334   0.096  0.92356   
I(temperature^2)        -0.008336   0.010572  -0.788  0.43041   
flower_density           0.132692   0.051084   2.598  0.00939 **
habitat_typeforest edge  0.236289   0.532269   0.444  0.65709   
habitat_typemeadow       1.211264   0.522810   2.317  0.02051 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 286.53  on 249  degrees of freedom
Residual deviance: 194.90  on 244  degrees of freedom
AIC: 206.9

Number of Fisher Scoring iterations: 6
# Calculate odds ratios with CIs
or_ci <- exp(cbind(OR = coef(best_butterfly_model), confint(best_butterfly_model)))
Waiting for profiling to be done...
round(or_ci, 3)
                           OR 2.5 %    97.5 %
(Intercept)             0.940 0.000 49427.327
temperature             1.047 0.411     2.746
I(temperature^2)        0.992 0.970     1.012
flower_density          1.142 1.036     1.266
habitat_typeforest edge 1.267 0.455     3.733
habitat_typemeadow      3.358 1.247     9.843
# Model diagnostics
cat("\n=== Model Diagnostics ===\n")

=== Model Diagnostics ===
cat("AIC:", AIC(best_butterfly_model), "\n")
AIC: 206.8971 
cat("McFadden's Pseudo R²:", 1 - (best_butterfly_model$deviance / best_butterfly_model$null.deviance), "\n")
McFadden's Pseudo R²: 0.3197984 
# ROC analysis
butterfly_roc <- roc(butterfly_data$presence, fitted(best_butterfly_model))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
cat("AUC:", round(auc(butterfly_roc), 3), "\n")
AUC: 0.865 

Visualize Effects

# Create prediction grids
temp_grid <- data.frame(
  temperature = seq(min(butterfly_data$temperature), 
                   max(butterfly_data$temperature), 
                   length.out = 100),
  flower_density = mean(butterfly_data$flower_density),
  habitat_type = "meadow"
)

flower_grid <- data.frame(
  temperature = mean(butterfly_data$temperature),
  flower_density = seq(min(butterfly_data$flower_density),
                      max(butterfly_data$flower_density),
                      length.out = 100),
  habitat_type = "meadow"
)

habitat_grid <- data.frame(
  temperature = mean(butterfly_data$temperature),
  flower_density = mean(butterfly_data$flower_density),
  habitat_type = levels(butterfly_data$habitat_type)
)

# Get predictions
temp_grid$prob <- predict(best_butterfly_model, newdata = temp_grid, type = "response")
flower_grid$prob <- predict(best_butterfly_model, newdata = flower_grid, type = "response")
habitat_grid$prob <- predict(best_butterfly_model, newdata = habitat_grid, type = "response")

# Create plots
p1 <- ggplot(temp_grid, aes(x = temperature, y = prob)) +
  geom_line(color = "blue", size = 1.2) +
  labs(title = "Effect of Temperature (Quadratic)",
       x = "Temperature (°C)", y = "P(Presence)") +
  ylim(0, 1) +
  theme_minimal()

p2 <- ggplot(flower_grid, aes(x = flower_density, y = prob)) +
  geom_line(color = "darkgreen", size = 1.2) +
  labs(title = "Effect of Flower Density",
       x = "Flowers per m²", y = "P(Presence)") +
  ylim(0, 1) +
  theme_minimal()

p3 <- ggplot(habitat_grid, aes(x = habitat_type, y = prob, fill = habitat_type)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_d(option = "plasma", end = 0.9) +
  labs(title = "Effect of Habitat Type",
       x = "Habitat Type", y = "P(Presence)") +
  ylim(0, 1) +
  theme_minimal() +
  theme(legend.position = "none")

gridExtra::grid.arrange(p1, p2, p3, ncol = 2)

Predicted effects from the butterfly presence model
Ecological interpretation of the butterfly model

Our analysis reveals several important patterns:

1. Temperature effect (quadratic): - Butterflies show a preference for intermediate temperatures around 25°C - The quadratic term (OR = 0.9917 < 1) indicates that very high or low temperatures reduce presence probability - This makes biological sense - butterflies are ectotherms with thermal optima

2. Flower density effect (linear): - Each additional flower per m² increases odds of presence by 14.2% - This positive relationship reflects the importance of nectar resources

3. Habitat type effect: - Meadows have the highest probability of butterfly presence (reference category) - Forest edges have lower probability (OR = 1.267) - Woodlands have the lowest probability (OR = NA) - This aligns with the ecology of sun-loving butterflies

Model performance: - AUC = 0.865 indicates good to excellent discrimination - The model successfully captures the key factors driving butterfly presence

3.14 Extension: Multinomial Logistic Regression Preview

In this chapter, we’ve focused on binary logistic regression (two outcomes). But what if you have more than two categories?

We’ll explore multinomial logistic regression in detail in a later chapter. For now, just be aware that the concepts we’ve learned extend to multi-category outcomes.

Examples of multinomial responses: - Habitat selection among 3+ habitat types - Choosing among multiple prey species - Disease status (healthy, mild infection, severe infection) - Behavioral states (foraging, resting, vigilant, traveling)

Key differences: - Multiple logit equations (one for each category vs. a reference) - More coefficients to interpret - Same link function (logit) and odds ratio interpretation

For now, remember that logistic regression provides the foundation for modeling categorical outcomes of any kind.

3.15 Connecting to Future Topics

The concepts you’ve learned in this chapter will be essential as we move forward:

Links to upcoming topics

Count data models (Poisson, Negative Binomial): - Also use link functions (log link instead of logit) - Also part of the GLM family - Similar interpretation challenges (coefficients vs. exponentiated coefficients)

Zero-inflated and Hurdle models: - Combine binary models (like logistic regression) with count models - Use logistic regression for the zero/non-zero part

Beta regression: - For proportion data bounded between 0 and 1 - Uses logit link like logistic regression - But response is continuous, not binary

Mixed models: - Can incorporate logistic regression with random effects - Handles non-independence in binary data

The logit link, odds ratios, and maximum likelihood estimation you’ve learned here will appear again and again!

3.16 Exercise: Analyzing Tree Mortality Data

Now it’s your turn! You’ll analyze a dataset on tree mortality following a drought event. The goal is to understand which factors predict whether individual trees survived or died.

# Simulate tree mortality data
set.seed(789)
n <- 300

tree_data <- data.frame(
  dbh = runif(n, 10, 80),                    # Diameter at breast height (cm)
  canopy_position = sample(c("suppressed", "intermediate", "dominant"), 
                           n, replace = TRUE, prob = c(0.3, 0.4, 0.3)),
  soil_depth = rnorm(n, 50, 15),            # Soil depth (cm)
  drought_severity = runif(n, 0, 10)        # Local drought severity index
)

# Generate mortality (1 = died, 0 = survived)
# Larger trees, dominant position, deeper soil, and less severe drought improve survival
logit_survival <- 2 - 0.04 * tree_data$dbh + 
                  ifelse(tree_data$canopy_position == "dominant", 1.5,
                        ifelse(tree_data$canopy_position == "intermediate", 0.5, 0)) +
                  0.03 * tree_data$soil_depth - 
                  0.4 * tree_data$drought_severity

prob_mortality <- 1 - plogis(logit_survival)  # Convert to mortality probability
tree_data$mortality <- rbinom(n, 1, prob_mortality)

# Save as CSV
write.csv(tree_data, "data/tree_mortality.csv", row.names = FALSE)

# Your tasks:
# 1. Load and explore the data
# [YOUR CODE HERE]

# 2. Fit a logistic regression model
# [YOUR CODE HERE]

# 3. Interpret coefficients as odds ratios
# [YOUR CODE HERE]

# 4. Check model diagnostics
# [YOUR CODE HERE]

# 5. Create an ROC curve and calculate AUC
# [YOUR CODE HERE]

# 6. Visualize the effects of predictors on mortality probability
# [YOUR CODE HERE]

# 7. Compare different candidate models using AIC
# [YOUR CODE HERE]

Start by examining the distribution of mortality across levels of your categorical predictor (canopy position). Create both summary tables and visualizations.

When interpreting odds ratios, remember: - OR > 1 means the predictor increases the odds of mortality - OR < 1 means the predictor decreases the odds of mortality
- For a decrease, it’s often clearer to say “reduces odds by X%” where X = (1 - OR) × 100

Consider whether any predictors might have non-linear effects (e.g., quadratic terms) or interactions. Test these using likelihood ratio tests.

Solution.

# Load the data
tree_data <- read.csv("data/tree_mortality.csv")

# 1. Explore the data
summary(tree_data)
str(tree_data)

# Convert canopy position to factor with appropriate reference level
tree_data$canopy_position <- factor(tree_data$canopy_position,
                                   levels = c("suppressed", "intermediate", "dominant"))

# Visualize relationships
library(ggplot2)
library(gridExtra)

p1 <- ggplot(tree_data, aes(x = factor(mortality), y = dbh, fill = factor(mortality))) +
  geom_boxplot() +
  labs(title = "DBH by Mortality", x = "Mortality (0=Alive, 1=Dead)", y = "DBH (cm)") +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(tree_data, aes(x = canopy_position, fill = factor(mortality))) +
  geom_bar(position = "fill") +
  labs(title = "Mortality by Canopy Position", x = NULL, y = "Proportion") +
  theme_minimal()

p3 <- ggplot(tree_data, aes(x = factor(mortality), y = soil_depth, fill = factor(mortality))) +
  geom_boxplot() +
  labs(title = "Soil Depth by Mortality", x = "Mortality", y = "Soil Depth (cm)") +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(tree_data, aes(x = drought_severity, fill = factor(mortality))) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 20) +
  labs(title = "Drought Severity Distribution", x = "Drought Severity", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, ncol = 2)

# 2. Fit logistic regression model
mortality_model <- glm(mortality ~ dbh + canopy_position + soil_depth + drought_severity,
                      family = binomial(link = "logit"),
                      data = tree_data)

summary(mortality_model)

# 3. Interpret coefficients as odds ratios
library(broom)

# Get odds ratios with confidence intervals
or_table <- exp(cbind(OR = coef(mortality_model), 
                     confint(mortality_model)))
round(or_table, 3)

# Interpretation
cat("\n=== Interpretation ===\n")
cat("DBH: For each 1-cm increase in diameter, odds of mortality change by factor of",
    round(or_table["dbh", "OR"], 3), "\n")
cat("  This is a", round((1 - or_table["dbh", "OR"]) * 100, 1), 
    "% reduction in odds per cm (larger trees survive better)\n\n")

cat("Intermediate vs. Suppressed: OR =", round(or_table["canopy_positionintermediate", "OR"], 3), "\n")
cat("Dominant vs. Suppressed: OR =", round(or_table["canopy_positiondominant", "OR"], 3), "\n")
cat("  Dominant trees have", round((1 - or_table["canopy_positiondominant", "OR"]) * 100, 1),
    "% lower odds of mortality\n\n")

cat("Soil Depth: OR =", round(or_table["soil_depth", "OR"], 3), "\n")
cat("Drought Severity: OR =", round(or_table["drought_severity", "OR"], 3), "\n")

# 4. Model diagnostics
library(DHARMa)

# Deviance and pseudo R-squared
cat("\n=== Model Fit ===\n")
cat("Null Deviance:", mortality_model$null.deviance, "\n")
cat("Residual Deviance:", mortality_model$deviance, "\n")
pseudo_r2 <- 1 - (mortality_model$deviance / mortality_model$null.deviance)
cat("McFadden's Pseudo R²:", round(pseudo_r2, 3), "\n")

# Residual plots
par(mfrow = c(2, 2))
plot(mortality_model)
par(mfrow = c(1, 1))

# Check for influential observations
tree_data$cooks_d <- cooks.distance(mortality_model)
influential_points <- which(tree_data$cooks_d > 4/nrow(tree_data))
cat("\nNumber of influential observations:", length(influential_points), "\n")

# 5. ROC curve and AUC
library(pROC)

mortality_roc <- roc(tree_data$mortality, fitted(mortality_model))
plot(mortality_roc, main = "ROC Curve for Tree Mortality Model",
     col = "blue", lwd = 2, print.auc = TRUE)
abline(a = 0, b = 1, lty = 2, col = "gray")

cat("\nAUC:", round(auc(mortality_roc), 3), "\n")

# 6. Visualize effects
# Effect of DBH
dbh_grid <- data.frame(
  dbh = seq(min(tree_data$dbh), max(tree_data$dbh), length.out = 100),
  canopy_position = "intermediate",
  soil_depth = mean(tree_data$soil_depth),
  drought_severity = mean(tree_data$drought_severity)
)
dbh_grid$prob_mortality <- predict(mortality_model, newdata = dbh_grid, type = "response")

# Effect of drought severity  
drought_grid <- data.frame(
  dbh = mean(tree_data$dbh),
  canopy_position = "intermediate",
  soil_depth = mean(tree_data$soil_depth),
  drought_severity = seq(min(tree_data$drought_severity), 
                         max(tree_data$drought_severity), 
                         length.out = 100)
)
(cbind(OR = coef(mortality_model), 
                     confint(mortality_model)))

cat("\n=== Odds Ratios with 95% Confidence Intervals ===\n")
print(round(or_ci, 3))

# Detailed interpretation
cat("\n=== Coefficient Interpretations ===\n")
cat("\nDBH: OR =", round(exp(coef(mortality_model)["dbh"]), 3))
cat("\n  Interpretation: For each 1-cm increase in DBH, odds of mortality")
cat("\n  are multiplied by", round(exp(coef(mortality_model)["dbh"]), 3))
cat("\n  This is a", round((1 - exp(coef(mortality_model)["dbh"])) * 100, 1), 
    "% REDUCTION in odds")
cat("\n  (Larger trees are more resistant to drought)\n")

cat("\nCanopy Position - Intermediate vs. Suppressed:")
cat("\n  OR =", round(exp(coef(mortality_model)["canopy_positionintermediate"]), 3))
cat("\n  Reduction in mortality odds:", 
    round((1 - exp(coef(mortality_model)["canopy_positionintermediate"])) * 100, 1), "%\n")

cat("\nCanopy Position - Dominant vs. Suppressed:")
cat("\n  OR =", round(exp(coef(mortality_model)["canopy_positiondominant"]), 3))
cat("\n  Reduction in mortality odds:", 
    round((1 - exp(coef(mortality_model)["canopy_positiondominant"])) * 100, 1), "%")
cat("\n  (Dominant trees have much better survival)\n")

cat("\nSoil Depth: OR =", round(exp(coef(mortality_model)["soil_depth"]), 3))
cat("\n  Each additional cm reduces mortality odds by",
    round((1 - exp(coef(mortality_model)["soil_depth"])) * 100, 1), "%\n")

cat("\nDrought Severity: OR =", round(exp(coef(mortality_model)["drought_severity"]), 3))
cat("\n  Each unit increase in drought severity increases mortality odds by",
    round((exp(coef(mortality_model)["drought_severity"]) - 1) * 100, 1), "%")
cat("\n  (Strong effect - drought is a major mortality driver)\n")

# 4. Model diagnostics
cat("\n=== Model Fit Statistics ===\n")
cat("Null Deviance:", mortality_model$null.deviance, "\n")
cat("Residual Deviance:", mortality_model$deviance, "\n")
pseudo_r2 <- 1 - (mortality_model$deviance / mortality_model$null.deviance)
cat("McFadden's Pseudo R²:", round(pseudo_r2, 3), "\n")

# Residual plots
par(mfrow = c(2, 2))
plot(mortality_model)
par(mfrow = c(1, 1))

# Check for influential observations
tree_data$cooks_d <- cooks.distance(mortality_model)
influential_threshold <- 4/nrow(tree_data)
n_influential <- sum(tree_data$cooks_d > influential_threshold)
cat("\nNumber of influential observations:", n_influential, "\n")

# 5. ROC curve and AUC
mortality_roc <- roc(tree_data$mortality, fitted(mortality_model))
plot(mortality_roc, main = "ROC Curve for Tree Mortality Model",
     col = "blue", lwd = 2, print.auc = TRUE)
abline(a = 0, b = 1, lty = 2, col = "gray")

cat("\nAUC:", round(auc(mortality_roc), 3), "\n")
cat("Interpretation:", 
    ifelse(auc(mortality_roc) >= 0.9, "Excellent", 
           ifelse(auc(mortality_roc) >= 0.8, "Good", "Fair")),
    "discriminatory ability\n")

# 6. Visualize effects
# Effect of DBH
dbh_grid <- data.frame(
  dbh = seq(min(tree_data$dbh), max(tree_data$dbh), length.out = 100),
  canopy_position = "intermediate",
  soil_depth = mean(tree_data$soil_depth),
  drought_severity = mean(tree_data$drought_severity)
)
dbh_grid$prob_mortality <- predict(mortality_model, newdata = dbh_grid, type = "response")

# Effect of drought severity  
drought_grid <- data.frame(
  dbh = mean(tree_data$dbh),
  canopy_position = "intermediate",
  soil_depth = mean(tree_data$soil_depth),
  drought_severity = seq(min(tree_data$drought_severity), 
                         max(tree_data$drought_severity), 
                         length.out = 100)
)
drought_grid$prob_mortality <- predict(mortality_model, newdata = drought_grid, 
                                       type = "response")

# Effect of canopy position
canopy_grid <- data.frame(
  dbh = mean(tree_data$dbh),
  canopy_position = levels(tree_data$canopy_position),
  soil_depth = mean(tree_data$soil_depth),
  drought_severity = mean(tree_data$drought_severity)
)
canopy_grid$prob_mortality <- predict(mortality_model, newdata = canopy_grid, 
                                      type = "response")

# Plot effects
p1 <- ggplot(dbh_grid, aes(x = dbh, y = prob_mortality)) +
  geom_line(color = "blue", size = 1.2) +
  labs(title = "Effect of Tree Size", 
       x = "DBH (cm)", 
       y = "P(Mortality)") +
  ylim(0, 1) + 
  theme_minimal()

p2 <- ggplot(drought_grid, aes(x = drought_severity, y = prob_mortality)) +
  geom_line(color = "red", size = 1.2) +
  labs(title = "Effect of Drought Severity", 
       x = "Drought Severity Index", 
       y = "P(Mortality)") +
  ylim(0, 1) + 
  theme_minimal()

p3 <- ggplot(canopy_grid, aes(x = canopy_position, y = prob_mortality, 
                              fill = canopy_position)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_d() +
  labs(title = "Effect of Canopy Position", 
       x = "Canopy Position", 
       y = "P(Mortality)") +
  ylim(0, 1) + 
  theme_minimal() + 
  theme(legend.position = "none")

grid.arrange(p1, p2, p3, ncol = 2)

# 7. Model comparison
# Try models with interactions or polynomial terms
model_dbh_drought <- glm(mortality ~ dbh * drought_severity + canopy_position + soil_depth,
                         family = binomial, data = tree_data)

model_reduced <- glm(mortality ~ dbh + canopy_position + drought_severity,
                    family = binomial, data = tree_data)

# Compare AICs
aic_comparison <- AIC(mortality_model, model_dbh_drought, model_reduced)
aic_comparison$delta_AIC <- aic_comparison$AIC - min(aic_comparison$AIC)
print(aic_comparison)

# Likelihood ratio test for interaction
anova(mortality_model, model_dbh_drought, test = "Chisq")

3.16 Key Takeaways

  • Binary responses require logistic regression, not linear regression
  • The logit link function constrains predictions to [0,1] and models log-odds
  • Odds and odds ratios are the natural scale for interpreting logistic regression
  • Coefficients represent effects on the log-odds scale; exponentiated coefficients are odds ratios
  • Odds ratios ≠ relative risk (except when outcomes are rare, < 10%)
  • Effects on probability are non-linear and depend on baseline probability
  • Model diagnostics include deviance, residual plots, ROC curves, and AUC
  • Complete separation causes estimation problems and requires special handling
  • Sample size matters: aim for at least 10-20 events per predictor variable
  • Logistic regression provides the foundation for all subsequent GLM models

3.17 Looking Ahead

In upcoming chapters, we’ll extend the GLM framework to other data types:

Chapter 5: Poisson Regression: - For count data (non-negative integers) - Uses log link instead of logit - Rate ratios analogous to odds ratios

Chapter 6: Negative Binomial Regression: - Handles overdispersed count data - Adds dispersion parameter like we added flexibility for binary data

Chapter 7: Zero-Inflated Models: - Combines logistic regression (for zeros) with count models - Two-part models for excess zeros

Chapter 8: Beta Regression: - For continuous proportions [0,1] - Uses logit link like logistic regression - Continuous response instead of binary

The concepts you’ve mastered here - link functions, likelihood-based inference, odds ratios, and model comparison - will appear throughout the course!

3.18 Additional Resources

  1. Agresti, A. (2018). An Introduction to Categorical Data Analysis (3rd ed.). Wiley.

  2. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.

  3. Harrell, F. E. (2015). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd ed.). Springer.

  4. McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd ed.). CRC Press. [Chapter 11 on GLMs]

  5. Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

  6. Jaeger, B. C., Edwards, L. J., Das, K., & Sen, P. K. (2017). An R² statistic for fixed effects in the generalized linear mixed model. Journal of Applied Statistics, 44(6), 1086-1105.

  7. UCLA Statistical Consulting: https://stats.oarc.ucla.edu/r/dae/logit-regression/