Day 35

Math 216: Statistical Thinking

Bastola

Correlation & Simple Linear Regression

Key Question: How do two quantitative variables relate to each other? Let’s explore correlation and regression analysis!

Data Structure: Each case \(i\) has two measurements
- \(x_i\): explanatory variable (predictor)
- \(y_i\): response variable (outcome)
Real-World Examples:
- Temperature vs. cricket chirp rate
- Height vs. weight
- Study time vs. exam scores

Visualizing Relationships

Visualizing Relationships with Scatterplots:

A scatterplot shows the relationship between \((x_i, y_i)\) pairs, helping us understand:

Form: Is the relationship linear or non-linear?
Direction: Positive (upward slope), negative (downward slope), or no association?
Strength: How closely do points cluster around the trend?

Example: Associations in Car dataset

Various Associations of quantitative variables in Cars data

Understanding Association Direction

Key Question: When one variable changes, what happens to the other?

Positive Association: As \(x\) increases, \(y\) increases
- Age of husband and age of wife
- Height and diameter of a tree
- Study time and exam scores
Negative Association: As \(x\) increases, \(y\) decreases
- Number of cigarettes smoked and lung capacity
- Depth of tire tread and miles driven
- Temperature and heating costs

Intuition: Positive relationships move together, negative relationships move apart!

Understanding Correlation: The \(r\) and \(\rho\) Story

Key Question: How do we measure the strength and direction of linear relationships? Meet correlation coefficients!

Correlation Coefficients: \(r\) (sample) or \(\rho\) (population) quantify linear relationships
Strength Scale: \(r \approx \pm 1\) (strong), \(r \approx 0\) (weak)
Direction: Positive (\(r > 0\)) or negative (\(r < 0\)) linear association

The Correlation Formula: \[ r = \frac{\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x}\right) \left(\frac{y_i - \bar{y}}{s_y}\right)}{n-1} \]

Interpretation Guide:

\(r = 1\): Perfect positive linear relationship
\(r = -1\): Perfect negative linear relationship
\(r = 0\): No linear relationship

# R code to calculate correlation
cor(data$x, data$y)  # Simple but powerful!

Car Correlations

Correlations of various variables in Cars data

Correlation & Regression Decision Framework

flowchart LR
    %% Styling definitions
    classDef start fill:#FFFACD,stroke:#FF8C00,stroke-width:2px,color:#000
    classDef decision fill:#E6F3FF,stroke:#1E88E5,stroke-width:2px,color:#000
    classDef action fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000
    classDef endStyle fill:#FFEBEE,stroke:#E53935,stroke-width:2px,color:#000

    %% Nodes
    %% Added <br/> to A and D to save horizontal space
    A(["Start: Two<br/>Quantitative Variables"]):::start
    B{"Create<br/>Scatterplot"}:::decision
    C{"Linear<br/>Pattern?"}:::decision
    D{"Calculate<br/>Correlation r"}:::decision
    E{"Strong?<br/>|r| > 0.7"}:::decision
    F["Fit Linear<br/>Regression<br/>ŷ = b₀ + b₁x"]:::action
    G["Consider<br/>Non-linear<br/>Models"]:::action
    H["Weak Rel.<br/>No Linear<br/>Model"]:::action
    I["Interpret<br/>Slope &<br/>Intercept"]:::endStyle

    %% Flow connections
    A --> B
    B --> C
    C -->|Yes| D
    C -->|No| G
    D --> E
    E -->|Yes| F
    E -->|No| H
    F --> I
    G --> I
    H --> I

Figure 1: Regression Analysis Flowchart

Deterministic vs. Probabilistic Models

Key Question: Are relationships in the real world perfectly predictable, or do we need to account for randomness?

Deterministic Model: Perfect, predictable relationships without error
- Example: \(y = 1.5x\) (always exactly 1.5 times x)
- Reality Check: Rarely exists in real-world data!
Probabilistic Model: Realistic relationships with randomness
- Example: \(y = 1.5x + \text{random error}\)
- General Form: \(y = \text{Deterministic component} + \text{Random error}\)
- Key Insight: Mean of random error is 0, so \(E(y)\) matches the deterministic component

Linear Regression Model

Key Question: How do we find the best straight line to describe our data? Welcome to linear regression!

Regression Equation: \(\hat{y} = b_0 + b_1x\)
- \(x\): explanatory variable (predictor)
- \(\hat{y}\): predicted response variable
Parameters:
- Slope (\(b_1\)): How much \(\hat{y}\) changes for each unit increase in \(x\) \[ b_1 = \frac{\text{change }\hat{y}}{\text{change } x} \]
- Intercept (\(b_0\)): Predicted \(y\) when \(x = 0\) \[ \hat{y} = b_0 + b_1(0) = b_0 \]

Simple Linear Regression: Fitting and Evaluation

Key Question: How do we actually find the “best” regression line?

Three-Step Process:
1. Hypothesize: Assume the deterministic component (e.g., \(E(y) = \beta_0 + \beta_1x\))
2. Estimate: Use least squares to find the best-fitting line
3. Evaluate: Assess model fit and use for prediction
Residuals: The vertical distance from each point to the regression line
- Geometric Meaning: How far each point is from our “best guess” line
- Statistical Purpose: Help us measure how well our model fits the data

Residuals

Implementing Simple Linear Regression

Real-World Example: Can we predict cricket chirp rate from temperature?

Research Question: Does temperature affect how fast crickets chirp?
Model Approach: Use linear regression to predict chirp rate from temperature
Practical Application: This could help estimate temperature by listening to crickets!

Methodology:

Collect temperature and chirp rate data
Fit a linear model using least squares
Assess how well temperature predicts chirp rate
Use the model for temperature prediction

Data Overview

Observation	Temperature (°F)	Chirp Rate (chirps/15 sec)
1	89	20
2	72	16
3	93	20
4	84	18
5	81	17
6	75	16
7	70	15
8	82	17
9	69	15
10	83	16
11	80	15
12	83	17
13	81	16
14	84	17
15	76	14

Step-by-Step: Cricket Chirp Analysis with lm()

Let’s analyze the cricket chirp data step by step using R’s lm() function!

# Create cricket chirp dataset
cricket_data <- data.frame(
  temperature = c(89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76),
  chirp_rate = c(20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14)
)

# Display first few rows
head(cricket_data)

  temperature chirp_rate
1          89         20
2          72         16
3          93         20
4          84         18
5          81         17
6          75         16

Step 1: Visualize the Relationship

Step 1: Create Scatterplot

First, let’s visualize the relationship between temperature and chirp rate with a scatterplot.

Observation

We can see a clear positive linear relationship - as temperature increases, chirp rate also increases!

Step 2: Fit Linear Regression Model

Step 2: Fit Linear Model

Now let’s fit a linear regression model using R’s lm() function.

# Fit linear regression model
chirp_model <- lm(chirp_rate ~ temperature, data = cricket_data)
summary(chirp_model) # Display model summary


Call:
lm(formula = chirp_rate ~ temperature, data = cricket_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7246 -0.6013  0.2164  0.6280  1.5221 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.37210    3.23064  -0.115 0.910064    
temperature  0.21180    0.04018   5.271 0.000151 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.01 on 13 degrees of freedom
Multiple R-squared:  0.6812,    Adjusted R-squared:  0.6567 
F-statistic: 27.78 on 1 and 13 DF,  p-value: 0.0001513

Step 3: Add Regression Line to Scatterplot

Step 3: Overlay Regression Line

Let’s add the regression line to our scatterplot to visualize the model fit.

Visual Assessment

The red line shows our best-fit linear model. Points are generally close to the line, indicating a good fit!

Step 4: Visualize Residuals

Step 4: Examine Residuals

Now let’s visualize the residuals - the vertical distances from each point to the regression line.

Understanding Residuals

The orange lines show residuals - how far each observation is from our prediction. Smaller residuals mean better model fit!

Summary: Cricket Chirp Analysis

Key Takeaways from our lm() analysis:

Strong Positive Relationship: Temperature strongly predicts cricket chirp rate
Model Equation: \(\hat{y} = -0.372 + 0.212x\)
Good Fit: R² = 0.681 (68.1% of variation explained)
Practical Application: Can predict chirp rate from temperature

lm() Function Benefits:

Simple syntax: lm(y ~ x, data)
Comprehensive output with coefficients, R², and significance
Easy to extract predictions and residuals
Foundation for more complex statistical modeling

Observation	Temperature (°F)	Chirp Rate (chirps/15 sec)
1	89	20
2	72	16
3	93	20
4	84	18
5	81	17
6	75	16
7	70	15
8	82	17
9	69	15
10	83	16
11	80	15
12	83	17
13	81	16
14	84	17
15	76	14

Observation	Temperature (°F)	Chirp Rate (chirps/15 sec)
1	89	20
2	72	16
3	93	20
4	84	18
5	81	17
6	75	16
7	70	15
8	82	17
9	69	15
10	83	16
11	80	15
12	83	17
13	81	16
14	84	17
15	76	14

Observation	Temperature (°F)	Chirp Rate (chirps/15 sec)
1	89	20
2	72	16
3	93	20
4	84	18
5	81	17
6	75	16
7	70	15
8	82	17
9	69	15
10	83	16
11	80	15
12	83	17
13	81	16
14	84	17
15	76	14