Chapter 1: Statistical Foundations for Generalized Linear Models
Author
Steffen Foerster
1.1 Introduction: Building a Statistical Toolkit
Welcome to the world of statistical modeling for ecological data! Before we dive into Generalized Linear Models (GLMs) in the next chapter, we need to make sure our statistical foundations are solid. Think of this chapter as checking your equipment before a field expedition - we want to ensure all the essential tools are ready and you know how to use them.
In this chapter, we’ll work through the core statistical concepts that underpin everything we’ll do in this course. We’ll use data from penguin populations in Antarctica to make these concepts concrete and biologically meaningful. These aren’t just abstract mathematical ideas - they’re practical tools for understanding ecological patterns and making scientific inferences.
About the Palmer Penguins dataset
Throughout this chapter, we’ll use the famous Palmer Penguins dataset collected by Dr. Kristen Gorman at Palmer Station, Antarctica. This dataset includes measurements from three penguin species (Adelie, Gentoo, and Chinstrap) across three islands in the Palmer Archipelago.
Variables we’ll use: - species: Penguin species (Adelie, Gentoo, Chinstrap) - island: Island where observed (Torgersen, Biscoe, Dream) - bill_length_mm: Bill length in millimeters - bill_depth_mm: Bill depth in millimeters - flipper_length_mm: Flipper length in millimeters - body_mass_g: Body mass in grams - sex: Penguin sex (male, female) - year: Year of observation (2007-2009)
Credit: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081.
Let’s load the packages and data we’ll need:
# Load required packageslibrary(tidyverse)library(gridExtra)# Set global theme with larger texttheme_set(theme_minimal(base_size =16))# Load the penguin datapenguins <-read.csv("data/penguins.csv")# Take a first lookstr(penguins)
'data.frame': 344 obs. of 8 variables:
$ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : chr "male" "female" "female" NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Summary of the datasetsummary(penguins)
species island bill_length_mm bill_depth_mm
Length:344 Length:344 Min. :32.10 Min. :13.10
Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60
Mode :character Mode :character Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 Length:344 Min. :2007
1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
Median :197.0 Median :4050 Mode :character Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
# Check for missing valuescolSums(is.na(penguins))
species island bill_length_mm bill_depth_mm
0 0 2 2
flipper_length_mm body_mass_g sex year
2 2 11 0
PART 1: STATISTICAL FOUNDATIONS
This section covers the conceptual foundations of statistical inference and would typically constitute the first lecture.
1.2 Probability Distributions: The Language of Variation
In ecology, variation is everywhere. No two penguins are exactly alike, no two populations respond identically to environmental change, and no two field seasons give us identical data. Probability distributions give us a mathematical language to describe and work with this variation.
1.2.1 What is a Probability Distribution?
A probability distribution describes how frequently different values occur in a dataset or population. It tells us: - Which values are most likely - Which values are rare - The overall pattern of variation
Key properties of probability distributions
For any probability distribution:
The probability of each possible value is between 0 and 1 (0% to 100%)
The sum of all probabilities equals 1 (total probability must be 100%)
For continuous distributions, we think about probability as the area under a curve
These aren’t just mathematical niceties - they reflect fundamental logic: something must happen (probabilities sum to 1), and impossible things don’t happen (probabilities can’t be negative or greater than 1).
1.2.2 Discrete vs. Continuous Distributions
Discrete distributions describe variables that can only take specific values: - Number of offspring (0, 1, 2, 3, …) - Presence/absence (0 or 1) - Count of individuals in a quadrat
Continuous distributions describe variables that can take any value within a range: - Body mass (3427.5 g, 3427.501 g, 3427.502 g, …) - Temperature (-15.3°C, -15.31°C, …) - Flipper length (190.5 mm, 190.51 mm, …)
Let’s visualize this distinction using our penguin data:
Code
# Create a discrete variable: number of penguins per island per yearisland_counts <- penguins %>%filter(!is.na(species)) %>%group_by(island, year) %>%summarize(count =n(), .groups ="drop")# Plot 1: Discrete - counts of penguinsp1 <-ggplot(island_counts, aes(x = count)) +geom_histogram(binwidth =1, fill ="steelblue", color ="black") +labs(title ="Discrete Variable",subtitle ="Count of penguins (only whole numbers possible)",x ="Number of Penguins",y ="Frequency") +scale_x_continuous(breaks =seq(0, 60, by =10))# Plot 2: Continuous - body massp2 <- penguins %>%filter(!is.na(body_mass_g)) %>%ggplot(aes(x = body_mass_g)) +geom_histogram(bins =30, fill ="coral", color ="black") +labs(title ="Continuous Variable",subtitle ="Body mass (any value within range possible)",x ="Body Mass (g)",y ="Frequency")grid.arrange(p1, p2, ncol =2)
Discrete (left) vs. continuous (right) variables in penguin data
1.2.3 The Normal Distribution: A Fundamental Pattern
The normal distribution (also called the Gaussian distribution) is perhaps the most important distribution in statistics. Many biological measurements are approximately normally distributed, and even when they’re not, the normal distribution plays a crucial role in statistical inference.
# Focus on Adelie penguins for clarityadelie <- penguins %>%filter(species =="Adelie", !is.na(body_mass_g))# Calculate mean and SD for referenceadelie_mean <-mean(adelie$body_mass_g)adelie_sd <-sd(adelie$body_mass_g)# Plot histogram with normal curve overlayggplot(adelie, aes(x = body_mass_g)) +geom_histogram(aes(y =after_stat(density)), bins =25, fill ="skyblue", color ="black", alpha =0.7) +stat_function(fun = dnorm, args =list(mean = adelie_mean, sd = adelie_sd),color ="darkred", linewidth =1.2) +geom_vline(xintercept = adelie_mean, linetype ="dashed", color ="darkred", linewidth =1) +labs(title ="Adelie Penguin Body Mass Distribution",subtitle =paste0("Mean = ", round(adelie_mean, 0), " g, SD = ", round(adelie_sd, 0), " g"),x ="Body Mass (g)",y ="Density") +annotate("text", x = adelie_mean +200, y =0.0008,label ="Mean", color ="darkred", size =5)
Body mass distribution for Adelie penguins
The 68-95-99.7 rule
For normally distributed data: - Approximately 68% of observations fall within 1 standard deviation (SD) of the mean - Approximately 95% fall within 2 SD of the mean - Approximately 99.7% fall within 3 SD of the mean
This is sometimes called the “empirical rule” and is extremely useful for quick assessments of whether observations are unusual.
For our Adelie penguins: - Mean = 3701 g - SD = 459 g - 95% of penguins weigh between 2784 g and 4618 g
1.2.4 Other Important Distributions
While we’ll focus on the normal distribution for now, ecological data come in many forms. Here’s a preview of distributions we’ll encounter in later chapters:
Code
set.seed(123)# Create example data for different distributionsx_norm <-seq(-4, 4, length.out =100)y_norm <-dnorm(x_norm, 0, 1)x_count <-0:15y_pois <-dpois(x_count, lambda =5)y_binom <-dbinom(x_count, size =15, prob =0.5)x_continuous <-seq(0.01, 10, length.out =100)y_gamma <-dgamma(x_continuous, shape =2, rate =0.5)# Plot 1: Normalp1 <-ggplot(data.frame(x = x_norm, y = y_norm), aes(x, y)) +geom_line(color ="blue", linewidth =1.2) +labs(title ="Normal Distribution",subtitle ="Continuous data, symmetric\n(body mass, temperature)",x ="Value", y ="Density") +theme(plot.subtitle =element_text(size =12))# Plot 2: Poissonp2 <-ggplot(data.frame(x = x_count, y = y_pois), aes(x, y)) +geom_bar(stat ="identity", fill ="darkgreen", color ="black") +labs(title ="Poisson Distribution",subtitle ="Count data, no upper limit\n(number of eggs, number seen)",x ="Count", y ="Probability") +theme(plot.subtitle =element_text(size =12))# Plot 3: Binomialp3 <-ggplot(data.frame(x = x_count, y = y_binom), aes(x, y)) +geom_bar(stat ="identity", fill ="coral", color ="black") +labs(title ="Binomial Distribution",subtitle ="Count data with upper limit\n(successes out of trials)",x ="Number of Successes", y ="Probability") +theme(plot.subtitle =element_text(size =12))# Plot 4: Gammap4 <-ggplot(data.frame(x = x_continuous, y = y_gamma), aes(x, y)) +geom_line(color ="purple", linewidth =1.2) +labs(title ="Gamma Distribution",subtitle ="Continuous positive data\n(waiting times, survival times)",x ="Value", y ="Density") +theme(plot.subtitle =element_text(size =12))grid.arrange(p1, p2, p3, p4, ncol =2)
Common probability distributions in ecology
Why distributions matter for modeling
The choice of probability distribution isn’t arbitrary - it should match your data type:
Continuous, unbounded data (can be positive or negative): Normal distribution
Continuous, positive-only data: Gamma or log-normal distribution
Count data: Poisson or negative binomial distribution
Binary data (success/failure): Binomial distribution
Proportions (between 0 and 1): Beta distribution
We’ll learn how to work with each of these in upcoming chapters. For now, recognize that choosing the right distribution is the first step in good statistical modeling.
1.3 Parameters and Estimates: Sample vs. Population
One of the most fundamental concepts in statistics is the distinction between populations and samples.
Population: The complete set of all individuals or observations we’re interested in
Example: All Adelie penguins in Antarctica
Parameters (true values): μ (mu, population mean), σ (sigma, population SD)
Sample: A subset of the population that we actually observe
Example: The 152 Adelie penguins in our dataset
Statistics (estimated values): x̄ (x-bar, sample mean), s (sample SD)
A crucial distinction
We almost never know the true population parameters. We use sample statistics to estimate them, and there’s always uncertainty in these estimates.
This uncertainty is not a flaw in our methods - it’s an inherent feature of working with samples. Good statistics is about quantifying and accounting for this uncertainty.
Let’s see this in action with our penguin data:
# Calculate statistics for each speciesspecies_stats <- penguins %>%filter(!is.na(body_mass_g), !is.na(species)) %>%group_by(species) %>%summarize(n =n(),mean_mass =mean(body_mass_g),sd_mass =sd(body_mass_g),se_mass = sd_mass /sqrt(n) )print(species_stats)
# Visualize with error barsggplot(species_stats, aes(x = species, y = mean_mass, fill = species)) +geom_col(alpha =0.7) +geom_errorbar(aes(ymin = mean_mass - sd_mass, ymax = mean_mass + sd_mass),width =0.2, linewidth =1) +scale_fill_viridis_d(option ="plasma", end =0.8) +labs(title ="Mean Body Mass by Penguin Species",subtitle ="Error bars show ± 1 standard deviation",x ="Species",y ="Body Mass (g)") +theme(legend.position ="none")
Sample statistics from different penguin species
Notice how each species has a different mean and standard deviation. These are sample statistics - our best estimates of the true population parameters based on the penguins we measured.
1.4 The Central Limit Theorem: Why the Normal Distribution is Special
The Central Limit Theorem (CLT) is one of the most powerful and surprising results in statistics. It tells us something remarkable:
If you take many samples from any population (regardless of its distribution) and calculate the mean of each sample, those sample means will be approximately normally distributed.
This works even if the original population is not normally distributed!
1.4.1 Visualizing the Central Limit Theorem
Let’s demonstrate this with our penguin data:
Code
set.seed(456)# Get body mass data (removing NAs)all_mass <- penguins$body_mass_g[!is.na(penguins$body_mass_g)]# Population parameterspop_mean <-mean(all_mass)pop_sd <-sd(all_mass)# Function to draw samples and calculate meanssample_means_n5 <-replicate(1000, mean(sample(all_mass, 5)))sample_means_n30 <-replicate(1000, mean(sample(all_mass, 30)))sample_means_n100 <-replicate(1000, mean(sample(all_mass, 100)))# Create data frame for plottingclt_data <-data.frame(mean =c(sample_means_n5, sample_means_n30, sample_means_n100),sample_size =rep(c("n = 5", "n = 30", "n = 100"), each =1000))# Calculate theoretical SD of sampling distributionse_n5 <- pop_sd /sqrt(5)se_n30 <- pop_sd /sqrt(30)se_n100 <- pop_sd /sqrt(100)# Plotggplot(clt_data, aes(x = mean)) +geom_histogram(aes(y =after_stat(density)), bins =30, fill ="steelblue", color ="black", alpha =0.7) +geom_vline(xintercept = pop_mean, color ="red", linetype ="dashed", linewidth =1) +facet_wrap(~sample_size, ncol =1, scales ="free_y") +labs(title ="Central Limit Theorem in Action",subtitle =paste0("Distribution of sample means from 1000 samples (Population mean = ", round(pop_mean, 0), " g)"),x ="Sample Mean Body Mass (g)",y ="Density") +theme(strip.text =element_text(size =14))
Demonstration of the Central Limit Theorem
What this shows
Notice that as sample size increases:
The distribution of sample means becomes more normally distributed
The spread (variability) of sample means decreases
The center stays at the population mean
The standard deviation of these sample means is called the standard error and equals:
\[SE = \frac{\sigma}{\sqrt{n}}\]
Where σ is the population standard deviation and n is the sample size.
For our example with n = 30: - Theoretical SE = 146.4 g - This matches what we see in the middle panel!
1.4.2 Implications of the Central Limit Theorem
The CLT has profound implications:
We can use the normal distribution for many inference procedures, even when our data aren’t normally distributed
Larger samples give us more precise estimates (smaller standard error)
We can quantify our uncertainty about population parameters using the normal distribution
This is why so much of classical statistics relies on the normal distribution - not because data are always normal, but because sample means are approximately normal!
1.5 Standard Errors and Confidence Intervals
We’ve seen that sample means vary from sample to sample. The standard error (SE) quantifies this variability and is our key to understanding uncertainty in estimates.
1.5.1 Standard Error
The standard error of the mean tells us how much sample means typically vary from the true population mean:
\[SE = \frac{s}{\sqrt{n}}\]
Where: - s is the sample standard deviation - n is the sample size
# Calculate standard error for each speciesspecies_uncertainty <- penguins %>%filter(!is.na(body_mass_g), !is.na(species)) %>%group_by(species) %>%summarize(n =n(),mean_mass =mean(body_mass_g),sd_mass =sd(body_mass_g),se_mass = sd_mass /sqrt(n),.groups ="drop" )print(species_uncertainty)
Notice that standard errors are much smaller than standard deviations. The SE tells us about uncertainty in our estimate of the mean, while the SD tells us about variability in individual observations.
1.5.2 Confidence Intervals
A confidence interval gives us a range of plausible values for a population parameter. The most common is the 95% confidence interval:
\[CI_{95\%} = \bar{x} \pm 1.96 \times SE\]
(We use 1.96 because that’s approximately 2 standard deviations in a normal distribution)
# Visualizeggplot(species_ci, aes(x = species, y = mean_mass, color = species)) +geom_point(size =4) +geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),width =0.2, linewidth =1.5) +scale_color_viridis_d(option ="plasma", end =0.8) +labs(title ="Mean Body Mass with 95% Confidence Intervals",subtitle ="Error bars show the range of plausible values for the true population mean",x ="Species",y ="Body Mass (g)") +theme(legend.position ="none")
95% confidence intervals for mean body mass
The CORRECT interpretation of confidence intervals
A 95% confidence interval means:
“If we repeated this study many times and calculated a 95% CI each time, approximately 95% of those intervals would contain the true population mean.”
It does NOT mean:
❌ “There’s a 95% probability the true mean is in this interval”
❌ “95% of individuals have values in this range”
The interval either contains the true mean or it doesn’t - we just don’t know which! But our method for constructing the interval is right 95% of the time in the long run.
Let’s visualize what this means:
Code
set.seed(789)# True populationpopulation <- penguins %>%filter(species =="Adelie", !is.na(body_mass_g)) %>%pull(body_mass_g)true_mean <-mean(population)# Take 20 samples and calculate CIsn_samples <-20sample_size <-30ci_samples <-replicate(n_samples, { samp <-sample(population, sample_size, replace =TRUE) m <-mean(samp) se <-sd(samp) /sqrt(sample_size)data.frame(mean = m,lower = m -1.96* se,upper = m +1.96* se )}, simplify =FALSE)ci_df <-bind_rows(ci_samples, .id ="sample") %>%mutate(sample =as.numeric(sample),contains_true = lower <= true_mean & upper >= true_mean )# Plotggplot(ci_df, aes(x = sample, y = mean, color = contains_true)) +geom_hline(yintercept = true_mean, linetype ="dashed", color ="red", linewidth =1) +geom_point(size =3) +geom_errorbar(aes(ymin = lower, ymax = upper), width =0.3, linewidth =1) +scale_color_manual(values =c("FALSE"="red", "TRUE"="steelblue"),labels =c("Misses true mean", "Contains true mean"),name ="") +coord_flip() +labs(title ="20 Different Samples from the Same Population",subtitle =paste0("True population mean = ", round(true_mean, 0), " g. Most CIs (", sum(ci_df$contains_true), "/20 = ",round(100*sum(ci_df$contains_true)/20, 0), "%) contain it."),x ="Sample Number",y ="Body Mass (g)") +theme(legend.position ="bottom")
Confidence interval interpretation: 20 samples from the same population
This visualization shows exactly what “95% confidence” means: in repeated sampling, about 95% of confidence intervals capture the true population parameter.
1.6 Hypothesis Testing: Logic and Limitations
Hypothesis testing is a framework for making decisions based on data. It’s deeply embedded in scientific practice, but it’s also widely misunderstood and misused. Let’s understand both what it does and what it doesn’t do.
1.6.1 The Logic of Hypothesis Testing
The null hypothesis significance testing (NHST) framework follows this logic:
Assume a null hypothesis (H₀): Usually “no effect” or “no difference”
Calculate a test statistic: Summarizes how far our data deviate from H₀
Determine probability: If H₀ were true, what’s the probability of seeing data this extreme?
Make a decision: If that probability (p-value) is very small, reject H₀
Let’s test whether male and female Adelie penguins differ in body mass:
# Filter to Adelie penguins with known sexadelie_sex <- penguins %>%filter(species =="Adelie", !is.na(sex), !is.na(body_mass_g))# Summary statisticsadelie_sex_summary <- adelie_sex %>%group_by(sex) %>%summarize(n =n(),mean_mass =mean(body_mass_g),sd_mass =sd(body_mass_g),.groups ="drop" )print(adelie_sex_summary)
# A tibble: 2 × 4
sex n mean_mass sd_mass
<chr> <int> <dbl> <dbl>
1 female 73 3369. 269.
2 male 73 4043. 347.
# Conduct t-testt_result <-t.test(body_mass_g ~ sex, data = adelie_sex)print(t_result)
Welch Two Sample t-test
data: body_mass_g by sex
t = -13.126, df = 135.69, p-value < 2.2e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-776.3012 -573.0139
sample estimates:
mean in group female mean in group male
3368.836 4043.493
ggplot(adelie_sex, aes(x = sex, y = body_mass_g, fill = sex)) +geom_boxplot(alpha =0.7, outlier.shape =NA) +geom_jitter(width =0.2, alpha =0.3, size =2) +stat_summary(fun = mean, geom ="point", shape =23, size =4, fill ="red", color ="darkred") +scale_fill_manual(values =c("female"="coral", "male"="steelblue")) +labs(title ="Body Mass by Sex in Adelie Penguins",subtitle =paste0("Difference = ", round(diff(adelie_sex_summary$mean_mass), 0)," g, p < 0.001"),x ="Sex",y ="Body Mass (g)") +theme(legend.position ="none")
Body mass distribution by sex in Adelie penguins
Interpreting the results
The t-test tells us: - Mean difference: 675 g - p-value < 0.001
This means: If male and female Adelie penguins actually had the same mean body mass (H₀), the probability of observing a difference as large as we did (or larger) is less than 0.1%.
Since this probability is very small, we reject H₀ and conclude that male and female Adelie penguins differ in body mass.
1.6.2 Type I and Type II Errors
When we make decisions based on hypothesis tests, two types of errors are possible:
X.
H..is.TRUE..no.real.effect.
H..is.FALSE..real.effect.exists.
We conclude no effect
✓ Correct decision
❌ Type II Error (β)
We conclude there IS an effect
❌ Type I Error (α = 0.05)
✓ Correct decision (Power = 1-β)
Type I Error (False Positive): Concluding there’s an effect when there isn’t
Probability = α (significance level, usually 0.05)
Type II Error (False Negative): Failing to detect a real effect
Probability = β (depends on sample size and effect size)
Statistical Power: Probability of detecting a real effect = 1 - β
Higher power is better! Achieve this with larger samples or larger effects
The multiple testing problem
If we conduct 20 independent hypothesis tests at α = 0.05, we expect about 1 false positive even if all null hypotheses are true (20 × 0.05 = 1).
This is why running many tests and reporting only the “significant” ones is problematic - some of those significant results are likely false positives!
Solutions include: - Pre-registration of hypotheses - Corrections for multiple testing (e.g., Bonferroni correction) - Focus on effect sizes and confidence intervals rather than just p-values
1.6.3 What P-Values Do (and Don’t) Tell Us
This is perhaps the most important section in the chapter. P-values are widely misunderstood, and this leads to serious problems in scientific inference.
What a p-value IS:
The probability of obtaining data as extreme as (or more extreme than) what we observed, assuming the null hypothesis is true.
In symbols: P(data or more extreme | H₀ is true)
What a p-value is NOT:
❌ The probability that the null hypothesis is true
❌ The probability that the alternative hypothesis is true
❌ The probability that your result is due to chance
❌ The probability that you’ve made a mistake
Cohen’s critique: “The Earth is Round (p < 0.05)”
In his famous 1994 paper, psychologist Jacob Cohen devastatingly critiqued null hypothesis significance testing. His key points:
The null hypothesis is usually false anyway: In ecology, do we really believe that two populations have exactly identical means, down to infinite decimal places? Of course not! Given enough data, we’d reject almost any null hypothesis.
P-values don’t tell us what we want to know: We want to know “Given my data, what’s the probability that H₀ is true?” But p-values tell us “Given that H₀ is true, what’s the probability of my data?”
The logic is backwards: Consider this analogy:
NHST says: “If a person is American, they’re probably not a member of Congress. This person is a member of Congress. Therefore, they’re probably not American.”
That’s obviously wrong! But it’s the same logical structure as NHST.
Statistical significance ≠ scientific importance: A tiny, trivial effect can be “statistically significant” with a large enough sample. Conversely, an important effect might not be “significant” with a small sample.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997-1003.
Code
set.seed(111)# Simulate two scenarios: large effect small n, small effect large n# Scenario 1: Large effect, small samplen_small <-15group1_small <-rnorm(n_small, mean =100, sd =15)group2_small <-rnorm(n_small, mean =120, sd =15) # 20-unit difference# Scenario 2: Small effect, large samplen_large <-200group1_large <-rnorm(n_large, mean =100, sd =15)group2_large <-rnorm(n_large, mean =105, sd =15) # 5-unit difference# Conduct t-teststest1 <-t.test(group1_small, group2_small)test2 <-t.test(group1_large, group2_large)# Create data frames for plottingsim_data <-bind_rows(data.frame(value =c(group1_small, group2_small),group =rep(c("Group A", "Group B"), each = n_small),scenario ="Large effect, small sample\n(may not be 'significant')"),data.frame(value =c(group1_large, group2_large),group =rep(c("Group A", "Group B"), each = n_large),scenario ="Small effect, large sample\n(may be 'significant')"))# Plotggplot(sim_data, aes(x = group, y = value, fill = group)) +geom_boxplot(alpha =0.7) +facet_wrap(~scenario) +scale_fill_manual(values =c("Group A"="lightblue", "Group B"="lightcoral")) +labs(title ="Effect Size vs. Statistical Significance",subtitle =paste0("Left: p = ", round(test1$p.value, 3), " | Right: p = ", round(test2$p.value, 3)),x ="Group",y ="Value") +theme(legend.position ="none",strip.text =element_text(size =12))
Sample size affects p-values, not just effect size
1.6.4 Better Practices for Statistical Inference
Given these problems with NHST, what should we do? Here are some recommendations that we’ll follow throughout this course:
Report effect sizes and confidence intervals, not just p-values
“Males are 668 g heavier (95% CI: 581-755 g)” is more informative than “p < 0.001”
Consider biological/ecological significance, not just statistical significance
Is the effect size large enough to matter in the real world?
Use model comparison approaches (coming in later chapters)
Instead of testing single hypotheses, compare multiple plausible models
Use information criteria (AIC) rather than p-values
Estimate effect sizes with uncertainty, rather than making binary decisions
“The effect is probably between X and Y” is often more useful than “reject/fail to reject”
Pre-register hypotheses and analysis plans when possible
Reduces the temptation to “p-hack” or “HARKing” (Hypothesizing After Results are Known)
Looking ahead
In Chapter 8 (Model Selection), we’ll learn about information-theoretic approaches that avoid many of the pitfalls of NHST. These methods compare multiple candidate models and quantify the evidence for each, rather than making binary reject/accept decisions.
For now, just remember: p-values are not measures of evidence, and statistical significance is not the goal of data analysis!
PART 2: LINEAR MODELS FOUNDATION
This section builds on Part 1 to introduce linear models as the foundation for GLMs, and would typically constitute the second lecture.
1.7 Introduction to Linear Models
Now that we have our statistical foundations in place, let’s build models! Linear models are the foundation for everything we’ll do in this course. Generalized Linear Models (the focus of Chapter 2 onwards) are extensions of linear models, so understanding linear models deeply is essential.
1.7.1 What is a Linear Model?
A linear model expresses a response variable (Y) as a linear combination of predictor variables (X₁, X₂, …) plus random error:
Where: - Y is the response variable (what we’re trying to predict or explain) - X₁, X₂, …, Xₚ are predictor variables (what we’re using to explain Y) - β₀ is the intercept (expected value of Y when all X’s are zero) - β₁, β₂, …, βₚ are slopes (effect of each X on Y) - ε (epsilon) is random error
The term “linear” refers to the fact that Y is a linear combination of the predictors - the equation is a straight line (or hyperplane in multiple dimensions).
Linear models in ecology
Linear models are incredibly flexible and can answer questions like:
How does body mass vary with flipper length?
Do different species have different body masses?
Does the relationship between body mass and flipper length differ among species?
How do multiple factors jointly affect body mass?
The basic framework is the same for all of these - only the predictors change!
1.7.2 A Simple Linear Model: Body Mass vs. Flipper Length
Let’s start with the simplest case: one continuous predictor.
Research question: Does flipper length predict body mass in Adelie penguins?
# Filter to complete casesadelie_complete <- penguins %>%filter(species =="Adelie", !is.na(flipper_length_mm), !is.na(body_mass_g))# Fit the modelmodel1 <-lm(body_mass_g ~ flipper_length_mm, data = adelie_complete)# Display resultssummary(model1)
Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = adelie_complete)
Residuals:
Min 1Q Median 3Q Max
-875.68 -331.10 -14.53 265.74 1144.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2535.837 964.798 -2.628 0.00948 **
flipper_length_mm 32.832 5.076 6.468 1.34e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 406.6 on 149 degrees of freedom
Multiple R-squared: 0.2192, Adjusted R-squared: 0.214
F-statistic: 41.83 on 1 and 149 DF, p-value: 1.343e-09
# Visualizeggplot(adelie_complete, aes(x = flipper_length_mm, y = body_mass_g)) +geom_point(alpha =0.6, size =2) +geom_smooth(method ="lm", se =TRUE, color ="blue", fill ="lightblue") +labs(title ="Body Mass vs. Flipper Length in Adelie Penguins",subtitle =paste0("Body Mass = ", round(coef(model1)[1], 0), " + ", round(coef(model1)[2], 1), " × Flipper Length"),x ="Flipper Length (mm)",y ="Body Mass (g)")
`geom_smooth()` using formula = 'y ~ x'
Relationship between flipper length and body mass
Interpreting the coefficients
From our model output:
Intercept (β₀) = -2536 g - This is the predicted body mass when flipper length = 0 mm - This doesn’t make biological sense! (Can’t have a penguin with 0 mm flippers) - The intercept is often not directly interpretable - that’s okay
Slope (β₁) = 32.8 g/mm - For every 1 mm increase in flipper length, body mass increases by 32.8 g - This IS directly interpretable and biologically meaningful - The p-value (< 0.001) indicates strong evidence against the null hypothesis of no relationship
R² = 0.219 - Approximately 22% of variation in body mass is explained by flipper length - The remaining ~78% is due to other factors (sex, individual variation, measurement error, etc.)
1.7.3 Adding a Categorical Predictor: Species Differences
Now let’s compare body mass across all three penguin species:
# Use all speciespenguins_complete <- penguins %>%filter(!is.na(body_mass_g), !is.na(species))# Fit model with species as predictormodel2 <-lm(body_mass_g ~ species, data = penguins_complete)summary(model2)
Call:
lm(formula = body_mass_g ~ species, data = penguins_complete)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3700.66 37.62 98.37 <2e-16 ***
speciesChinstrap 32.43 67.51 0.48 0.631
speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
# Calculate means for plottingspecies_means <- penguins_complete %>%group_by(species) %>%summarize(mean_mass =mean(body_mass_g), .groups ="drop")ggplot(penguins_complete, aes(x = species, y = body_mass_g, fill = species)) +geom_boxplot(alpha =0.7, outlier.shape =NA) +geom_jitter(width =0.2, alpha =0.2, size =1) +geom_point(data = species_means, aes(y = mean_mass), shape =23, size =4, fill ="red", color ="darkred") +scale_fill_viridis_d(option ="plasma", end =0.8) +labs(title ="Body Mass by Species",subtitle ="Diamonds show species means",x ="Species",y ="Body Mass (g)") +theme(legend.position ="none")
Body mass varies significantly among species
How R handles categorical predictors
When you include a categorical variable as a predictor, R automatically creates dummy variables:
From the model output: - Intercept = 3701 g → Mean body mass for Adelie (reference category) - speciesChinstrap = 32 g → Difference between Chinstrap and Adelie - speciesGentoo = 1375 g → Difference between Gentoo and Adelie
So: - Predicted Adelie mass = 3701 g - Predicted Chinstrap mass = 3701 + 32 = 3733 g - Predicted Gentoo mass = 3701 + 1375 = 5076 g
R chooses the reference category alphabetically by default (Adelie comes first).
1.8 Multiple Predictors: Additive Models
Real ecological patterns usually involve multiple factors. Let’s build a model with both flipper length and species:
# Fit additive modelmodel3 <-lm(body_mass_g ~ flipper_length_mm + species, data = penguins_complete)summary(model3)
Call:
lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins_complete)
Residuals:
Min 1Q Median 3Q Max
-927.70 -254.82 -23.92 241.16 1191.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
speciesGentoo 266.810 95.264 2.801 0.005392 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 375.5 on 338 degrees of freedom
Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
# Create predictionspred_data <-expand.grid(flipper_length_mm =seq(min(penguins_complete$flipper_length_mm, na.rm =TRUE),max(penguins_complete$flipper_length_mm, na.rm =TRUE),length.out =100),species =unique(penguins_complete$species))pred_data$predicted <-predict(model3, newdata = pred_data)# Plotggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point(alpha =0.4) +geom_line(data = pred_data, aes(y = predicted), linewidth =1.2) +scale_color_viridis_d(option ="plasma", end =0.8) +labs(title ="Additive Model: Body Mass ~ Flipper Length + Species",subtitle ="Parallel lines indicate same slope for all species",x ="Flipper Length (mm)",y ="Body Mass (g)",color ="Species")
Additive model: parallel regression lines for each species
Understanding additive effects
In an additive model, predictors have independent effects:
Flipper length effect (β₁) = 40.7 g/mm - For any species, a 1 mm increase in flipper length predicts a 40.7 g increase in body mass - This is the same slope for all species (parallel lines)
Species effects: - Chinstrap vs. Adelie: -207 g (after accounting for flipper length) - Gentoo vs. Adelie: 267 g (after accounting for flipper length)
Notice that after accounting for flipper length, Chinstrap and Adelie penguins have similar body masses (coefficient is not significant), but Gentoo penguins are still much heavier.
This makes biological sense: Gentoo penguins are genuinely larger birds, not just longer-flippered!
1.9 Interactions: When Effects Depend on Context
Sometimes the effect of one variable depends on the value of another. This is called an interaction.
Research question: Does the relationship between flipper length and body mass differ among species?
# Fit interaction modelmodel4 <-lm(body_mass_g ~ flipper_length_mm * species, data = penguins_complete)summary(model4)
Call:
lm(formula = body_mass_g ~ flipper_length_mm * species, data = penguins_complete)
Residuals:
Min 1Q Median 3Q Max
-911.18 -251.93 -31.77 197.82 1144.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2535.837 879.468 -2.883 0.00419 **
flipper_length_mm 32.832 4.627 7.095 7.69e-12 ***
speciesChinstrap -501.359 1523.459 -0.329 0.74229
speciesGentoo -4251.444 1427.332 -2.979 0.00311 **
flipper_length_mm:speciesChinstrap 1.742 7.856 0.222 0.82467
flipper_length_mm:speciesGentoo 21.791 6.941 3.139 0.00184 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 370.6 on 336 degrees of freedom
Multiple R-squared: 0.7896, Adjusted R-squared: 0.7864
F-statistic: 252.2 on 5 and 336 DF, p-value: < 2.2e-16
# Create predictionspred_data_int <-expand.grid(flipper_length_mm =seq(min(penguins_complete$flipper_length_mm, na.rm =TRUE),max(penguins_complete$flipper_length_mm, na.rm =TRUE),length.out =100),species =unique(penguins_complete$species))pred_data_int$predicted <-predict(model4, newdata = pred_data_int)# Plotggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point(alpha =0.4) +geom_line(data = pred_data_int, aes(y = predicted), linewidth =1.2) +scale_color_viridis_d(option ="plasma", end =0.8) +labs(title ="Interaction Model: Body Mass ~ Flipper Length × Species",subtitle ="Non-parallel lines indicate different slopes for each species",x ="Flipper Length (mm)",y ="Body Mass (g)",color ="Species")
Interaction model: non-parallel regression lines
Interpreting interactions
The interaction model includes these terms:
Main effects (same as before)
Interaction terms: flipper_length_mm:speciesChinstrap and flipper_length_mm:speciesGentoo
The interaction coefficients tell us how much the slope differs from the reference species (Adelie).
Slope for Adelie = 32.8 g/mm
Slope for Chinstrap = 32.8 + 1.7 = 34.6 g/mm
Slope for Gentoo = 32.8 + 21.8 = 54.6 g/mm
The interaction for Gentoo is significant (p < 0.001), indicating that the body mass-flipper length relationship is indeed different for Gentoo penguins compared to Adelie penguins.
1.9.1 When to Include Interactions
Interactions should be included when you have good reason to believe that one variable’s effect depends on another:
Include interactions when: - There’s biological reason to expect different slopes (e.g., different species may respond differently to temperature) - Exploratory plots suggest non-parallel patterns - The interaction is theoretically important to your question
Be cautious about: - Including many interactions (models become complex and hard to interpret) - Interpreting main effects when interactions are present (main effects are conditional on other variables = 0) - Overfitting with small sample sizes
1.10 Model Assumptions and Diagnostics
Linear models make several assumptions. Violating these assumptions can lead to incorrect inferences.
1.10.1 The Four Key Assumptions
Linearity: The relationship between predictors and response is linear
Independence: Observations are independent of each other
Homoscedasticity: Variance of residuals is constant across fitted values
Normality: Residuals are approximately normally distributed
1. Residuals vs. Fitted (top-left): - Check for non-linearity and homoscedasticity - Should see random scatter around horizontal line at 0 - Fan-shaped pattern indicates heteroscedasticity - Curved pattern indicates non-linearity
2. Normal Q-Q (top-right): - Check for normality of residuals - Points should fall along diagonal line - Systematic deviations indicate non-normality
3. Scale-Location (bottom-left): - Another check for homoscedasticity - Should see random scatter with roughly horizontal red line - Increasing or decreasing trend indicates heteroscedasticity
4. Residuals vs. Leverage (bottom-right): - Identifies influential observations - Points outside dashed lines (Cook’s distance) are influential - Be cautious if influential points are also outliers
For our model, these diagnostics look pretty good! Minor deviations are okay - models are robust to small violations.
1.10.2 What to Do When Assumptions Are Violated
If assumptions are violated, you have several options:
Transform the response variable:
Log transformation for right-skewed data
Square root transformation for count data
But transformations make interpretation harder!
Use a different model:
If data are counts: Poisson or negative binomial regression (Chapter 4-5)
If data are proportions: Beta regression (Chapter 7)
If data are binary: Logistic regression (Chapter 3)
Use robust methods:
Robust regression is less sensitive to outliers
Bootstrapping doesn’t require normality
Collect more data:
Sometimes the best solution!
Larger samples are more robust to violations
This is why we need GLMs!
Many ecological data violate linear model assumptions: - Count data (number of individuals) can’t be negative and are often right-skewed - Binary data (presence/absence) definitely aren’t normal - Proportions (percent cover) are bounded between 0 and 1
Generalized Linear Models extend the linear model framework to handle these data types properly without awkward transformations. That’s what we’ll learn in Chapter 2!
1.11 Model Comparison: Which Model is Best?
We’ve fit several models to the penguin data. How do we choose between them?
# Compare our models using AICAIC(model1, model2, model3, model4)
Warning in AIC.default(model1, model2, model3, model4): models are not all
fitted to the same number of observations
AIC balances model fit against model complexity: - Lower AIC is better - Penalizes complex models (more parameters) - Difference of 2+ AIC units is considered meaningful
From our comparison: - Model 4 (interaction) has lowest AIC → best balance of fit and complexity - Model 1 (flipper length only) has highest AIC → missing important predictors - Model 3 (additive) is close to Model 4 → could be a reasonable simpler alternative
We’ll cover model selection in much more detail in Chapter 8. For now, just know that AIC is one tool for comparing models.
1.12 Key Takeaways from Chapter 1
Before we move on to GLMs, let’s summarize the key concepts:
Statistical Foundations: - Probability distributions describe patterns of variation - We estimate population parameters from sample statistics, always with uncertainty - The Central Limit Theorem explains why the normal distribution is so important - Standard errors and confidence intervals quantify uncertainty - P-values don’t tell us what we think they do - effect sizes and CIs are more informative
Linear Models: - Linear models express response as a linear combination of predictors - Can handle continuous and categorical predictors - Interactions allow effects to depend on context - Check assumptions using diagnostic plots - Compare models using criteria like AIC
Looking Ahead: - Many ecological data violate linear model assumptions - GLMs extend linear models to different data types - The framework is the same - we just add a link function and choose an appropriate distribution!
1.13 Practice Exercises
Use the penguin dataset to answer these questions. Try to work through them yourself before looking at hints or solutions!
Exercise 1: Exploring bill dimensions
Question: Is there a relationship between bill length and bill depth? Does this relationship differ among species?
Tasks: a) Create a scatter plot of bill length vs. bill depth, colored by species b) Fit a linear model with bill depth as the response and bill length as the predictor c) Fit an interaction model including both bill length and species d) Which model is better according to AIC? e) Interpret the results biologically
# Your code here!# Remember to filter out NA values first
For the plot, use geom_point() and geom_smooth(). For models, use formulas like bill_depth_mm ~ bill_length_mm and bill_depth_mm ~ bill_length_mm * species.
Exercise 2: Sex differences across species
Question: Do male and female penguins differ in body mass? Is this difference consistent across all three species?
Tasks: a) Calculate mean body mass by sex and species b) Create a visualization showing body mass distributions by sex for each species c) Fit an interaction model to test whether sex differences vary among species d) Interpret the coefficients: is the sex difference larger in some species?
# Your code here!
Hint for Exercise 2
For part (c), your model formula should be: body_mass_g ~ sex * species
Remember that interaction coefficients tell you how much the effect differs from the reference group!
Exercise 3: Predicting body mass
Question: Build the best model you can to predict penguin body mass.
Tasks: a) Consider multiple predictors: flipper length, bill length, bill depth, sex, species b) Fit several candidate models (at least 3 different models) c) Compare models using AIC d) Check diagnostic plots for your best model e) Report the R² and interpret what it means
# Your code here!
Hint for Exercise 3
Think about which predictors are likely to be important biologically. Start with simple models and add complexity.
Some candidates to consider: - body_mass_g ~ flipper_length_mm - body_mass_g ~ flipper_length_mm + sex - body_mass_g ~ flipper_length_mm + sex + species - body_mass_g ~ flipper_length_mm * sex + species
Don’t forget to remove NAs from all variables you’ll use!
Exercise 4: Testing assumptions
Question: Choose one of your models from Exercise 3 and thoroughly check its assumptions.
Tasks: a) Create all four diagnostic plots b) For each plot, explain what you’re looking for and what you observe c) Are there any concerning violations? d) If you found violations, what would you do about them?
# Your code here!
Exercise 5: Confidence intervals and interpretation
Question: Let’s practice proper interpretation of statistical results.
Using the interaction model body_mass_g ~ flipper_length_mm * species:
Tasks: a) Extract and report the confidence intervals for all coefficients b) Write out the full prediction equation for Gentoo penguins c) What is the predicted body mass for a Gentoo penguin with a flipper length of 220 mm? d) Calculate the 95% confidence interval for this prediction e) Write a brief summary (2-3 sentences) of what this model tells us about penguin body mass, focusing on effect sizes rather than p-values
# Get confidence intervalsconfint(model4)# For prediction, you'll need to create a new data frame:new_penguin <-data.frame(flipper_length_mm =220,species ="Gentoo")# Then use predict() with interval = "confidence"
Practice writing about statistics
Part (e) is crucial! Practice explaining results in plain language: - Lead with effect sizes and their confidence intervals - Describe what the relationships mean biologically - Don’t just report “significant” or “not significant”
1.14 Additional Resources
Textbooks
Gelman, A., & Hill, J. (2007).Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Excellent practical guide with great insights on interpretation
Quinn, G. P., & Keough, M. J. (2002).Experimental Design and Data Analysis for Biologists. Cambridge University Press.
Written specifically for biologists, covers linear models thoroughly
Zuur, A. F., Ieno, E. N., & Smith, G. M. (2007).Analysing Ecological Data. Springer.
Ecological examples throughout, practical focus
Articles
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997-1003.
The essential critique of p-values and NHST
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4), 591-605.
Why and how to report effect sizes
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.
The American Statistical Association’s guidelines on p-value interpretation
Online Resources
Statistics for Ecologists - https://naturalandenvironmentalscience.shinyapps.io/EnvironmentalStatistics/
Interactive tutorials for ecological statistics
Data Analysis in R - https://bookdown.org/steve_midway/DAR/
Free online book focused on ecological applications
Ready for Chapter 2? Now that we have our foundations solid, we’re prepared to extend this framework to handle the diverse data types we encounter in ecology. In the next chapter, we’ll see how Generalized Linear Models build on everything we’ve learned here!