Math 216: Statistical Thinking
Key Question: How do two quantitative variables relate to each other? Let’s explore correlation and regression analysis!
Data Structure: Each case \(i\) has two measurements
Real-World Examples:
Visualizing Relationships with Scatterplots:
A scatterplot shows the relationship between \((x_i, y_i)\) pairs, helping us understand:
Key Question: When one variable changes, what happens to the other?
Positive Association: As \(x\) increases, \(y\) increases
Negative Association: As \(x\) increases, \(y\) decreases
Intuition: Positive relationships move together, negative relationships move apart!
Key Question: How do we measure the strength and direction of linear relationships? Meet correlation coefficients!
The Correlation Formula: \[ r = \frac{\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x}\right) \left(\frac{y_i - \bar{y}}{s_y}\right)}{n-1} \]
Interpretation Guide:
flowchart LR
%% Styling definitions
classDef start fill:#FFFACD,stroke:#FF8C00,stroke-width:2px,color:#000
classDef decision fill:#E6F3FF,stroke:#1E88E5,stroke-width:2px,color:#000
classDef action fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000
classDef endStyle fill:#FFEBEE,stroke:#E53935,stroke-width:2px,color:#000
%% Nodes
%% Added <br/> to A and D to save horizontal space
A(["Start: Two<br/>Quantitative Variables"]):::start
B{"Create<br/>Scatterplot"}:::decision
C{"Linear<br/>Pattern?"}:::decision
D{"Calculate<br/>Correlation r"}:::decision
E{"Strong?<br/>|r| > 0.7"}:::decision
F["Fit Linear<br/>Regression<br/>ŷ = b₀ + b₁x"]:::action
G["Consider<br/>Non-linear<br/>Models"]:::action
H["Weak Rel.<br/>No Linear<br/>Model"]:::action
I["Interpret<br/>Slope &<br/>Intercept"]:::endStyle
%% Flow connections
A --> B
B --> C
C -->|Yes| D
C -->|No| G
D --> E
E -->|Yes| F
E -->|No| H
F --> I
G --> I
H --> I

Key Question: Are relationships in the real world perfectly predictable, or do we need to account for randomness?
Key Question: How do we find the best straight line to describe our data? Welcome to linear regression!

Key Question: How do we actually find the “best” regression line?
Three-Step Process:
Residuals: The vertical distance from each point to the regression line

Real-World Example: Can we predict cricket chirp rate from temperature?
Methodology:
| Observation | Temperature (°F) | Chirp Rate (chirps/15 sec) |
|---|---|---|
| 1 | 89 | 20 |
| 2 | 72 | 16 |
| 3 | 93 | 20 |
| 4 | 84 | 18 |
| 5 | 81 | 17 |
| 6 | 75 | 16 |
| 7 | 70 | 15 |
| 8 | 82 | 17 |
| 9 | 69 | 15 |
| 10 | 83 | 16 |
| 11 | 80 | 15 |
| 12 | 83 | 17 |
| 13 | 81 | 16 |
| 14 | 84 | 17 |
| 15 | 76 | 14 |
Let’s analyze the cricket chirp data step by step using R’s lm() function!
# Create cricket chirp dataset
cricket_data <- data.frame(
temperature = c(89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76),
chirp_rate = c(20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14)
)
# Display first few rows
head(cricket_data) temperature chirp_rate
1 89 20
2 72 16
3 93 20
4 84 18
5 81 17
6 75 16
Step 1: Create Scatterplot
First, let’s visualize the relationship between temperature and chirp rate with a scatterplot.
Observation
We can see a clear positive linear relationship - as temperature increases, chirp rate also increases!
Step 2: Fit Linear Model
Now let’s fit a linear regression model using R’s lm() function.
# Fit linear regression model
chirp_model <- lm(chirp_rate ~ temperature, data = cricket_data)
summary(chirp_model) # Display model summary
Call:
lm(formula = chirp_rate ~ temperature, data = cricket_data)
Residuals:
Min 1Q Median 3Q Max
-1.7246 -0.6013 0.2164 0.6280 1.5221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.37210 3.23064 -0.115 0.910064
temperature 0.21180 0.04018 5.271 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.01 on 13 degrees of freedom
Multiple R-squared: 0.6812, Adjusted R-squared: 0.6567
F-statistic: 27.78 on 1 and 13 DF, p-value: 0.0001513
Step 3: Overlay Regression Line
Let’s add the regression line to our scatterplot to visualize the model fit.
Visual Assessment
The red line shows our best-fit linear model. Points are generally close to the line, indicating a good fit!
Step 4: Examine Residuals
Now let’s visualize the residuals - the vertical distances from each point to the regression line.
Understanding Residuals
The orange lines show residuals - how far each observation is from our prediction. Smaller residuals mean better model fit!
Key Takeaways from our lm() analysis:
lm() Function Benefits:
lm(y ~ x, data)