Huanfa Chen - huanfa.chen@ucl.ac.uk
25th October 2025
Lecture 4 - linear algebra
Looked at:
The most basic statistical information about the dataset and each variable (separately)
Denote city population by \([y_1, y_2, ..., y_n]\) and variance by \(\sigma^2\)
\[ \begin{aligned} \sigma^2 &= \frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{n} \\ &= \frac{(y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \dots + (y_n - \bar{y})^2}{n} \end{aligned} \]
Let’s take variance as an exmaple
| Questions | Answer |
|---|---|
| Does it exist for all inputs? | Yes |
| Meaning of sign (+/-) | Non-negative; 0 means all values are identical |
| Range | [0, \(\infty\)) |
| Is it normalised (i.e. in [-1,1] or [0,1]) | No |
| Is it symmetric, if it involves multiple inputs? | N/A |
Examples:
\[ \mathrm{Cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \bar{x} \right) \left( y_i - \bar{y} \right) \]
| Questions | Answer |
|---|---|
| Does it exist for all inputs? | Yes |
| Meaning of sign (+/-) | Positive: x, y change in the same direction |
| Range | (-\(\infty\), \(\infty\)) |
| Is it normalised (i.e. in [-1,1] or [0,1]) | No |
| Is it symmetric, if it involves multiple inputs? | Yes, cov(x,y)=cov(y,x) |
\[ r_{xy} = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \, \sigma_Y} = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \; \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2}} \]
\[ r_{xy} = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \, \sigma_Y} = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \; \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2}} \]
Francis Galton
Image credit: https://commons.wikimedia.org/wiki/File:Portrait_of_Francis_Galton._Wellcome_M0002305.jpg
Karl Pearson
Image credit: https://commons.wikimedia.org/wiki/File:Karl_Pearson,_1912.jpg
| Questions | Answer |
|---|---|
| Does it exist for all inputs? | Not when variance of x or y equals 0 |
| Meaning of sign (+/-) | Positive: x, y change in the same direction |
| Range | [-1, 1] |
| Is it normalised (i.e. in [-1,1] or [0,1]) | Yes |
| Is it symmetric, if it involves multiple inputs? | Yes, Cor(x,y)=Cor(y,x) |
Where:
- \(t\) = t-statistic
- \(r\) = calculated Pearson correlation coefficient
- \(n\) = number of paired observations
Image credit: https://commons.wikimedia.org/wiki/File:Correlation_examples.png
Original Values:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| x | 0.55 | 0.72 | 0.6 | 0.54 | 0.42 | 0.65 | 0.44 | 0.89 | 0.96 | 0.38 |
| y | 0.79 | 0.53 | 0.57 | 0.93 | 0.07 | 0.09 | 0.02 | 0.83 | 0.78 | 0.87 |
Step 1: Ranks:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| rank_x | 5 | 8 | 6 | 4 | 2 | 7 | 3 | 9 | 10 | 1 |
| rank_y | 7 | 4 | 5 | 10 | 2 | 3 | 1 | 8 | 6 | 9 |
Step 2: Spearman Correlation = 0.0424
On original data, increase the largest X value by 1000
Original Pearson C = 0.3350, Spearman C = 0.0424
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| 0.55 | 0.72 | 0.6 | 0.54 | 0.42 | 0.65 | 0.44 | 0.89 | 1000.96 | 0.38 |
| 0.79 | 0.53 | 0.57 | 0.93 | 0.07 | 0.09 | 0.02 | 0.83 | 0.78 | 0.87 |
Updated Pearson C = 0.2261, Spearman C = 0.0424
| Nominal | Ordinal | Interval | Ratio | |
|---|---|---|---|---|
| Pearson correlation | ❌ | ❌ | ✅ | ✅ |
| Spearman correlation | ❌ | ✅ | ✅ | ✅ |
Image credit: https://www.geeksforgeeks.org/data-science/t-test/
Image credit: datanovia.com
| Assumption | Testing |
|---|---|
| Independence of observations | Design-based check, randomization review |
| No significant outliers | Boxplots/Z-scores; be cautious when removing outliers |
| Normality | Shapiro–Wilk test; Q–Q plots |
| Homogeneity of variances | Levene test, Welch test |
| Feature | One-Way ANOVA | Two-Way ANOVA |
|---|---|---|
| Purpose | Compare means of three or more groups based on one factor | Compare means based on two factors and their interaction |
| Example | Do school attainments vary across LA (Camden/Islington/Westminster)? | Do school attainments vary across LA (Camden/Islington/Westminster) & gender (male/female)? |
| Assumptions | Normality, no outliers, homogeneity of variances, independence | Same as one-way plus assumptions for interaction effects |
| Output | F-statistic, p-value for group differences | F-statistics and p-values for each factor and interaction effects |
© CASA | ucl.ac.uk/bartlett/casa