Practical 5: Measuring relationships

This week is focused on measuring the relationship between two variables using Pearson/Spearman correlation coefficients.

In the practical, we’re going to explore the correlation between different variables in the school dataset.

import numpy as np
import pandas as pd

from scipy import stats
from scipy.stats import pearsonr, spearmanr

import statsmodels.api as sm
from statsmodels.formula.api import ols

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

Dataset

We’re going to use a piece of filtered school data that is kindly provided by Adam. In W6, Adam will describe the source and processing of this dataset in details. For now, we just use this dataset as is.

# read in the dataset from Github
df_school = pd.read_csv("https://raw.githubusercontent.com/huanfachen/QM/refs/heads/main/sessions/L6_data/Performancetables_130242/2022-2023/england_filtered.csv")

Check the shape and columns of this data frame.

print(f"df_school has {df_school.shape[0]} rows and {df_school.shape[1]} columns.")
print("First 5 rows:\n")
print(df_school.head(5))
df_school has 3056 rows and 37 columns.
First 5 rows:

      URN                                  SCHNAME.x  LEA  LANAME  TOWN.x  \
0  100053                     Acland Burghley School  202  Camden  London   
1  100054                The Camden School for Girls  202  Camden  London   
2  100052                           Hampstead School  202  Camden  London   
3  100049                          Haverstock School  202  Camden  London   
4  100059  La Sainte Union Catholic Secondary School  202  Camden  London   

  gor_name  TOTPUPS  ATT8SCR  ATT8SCRENG  ATT8SCRMAT  ATT8SCR_FSM6CLA1A  \
0   London   1163.0     50.3        10.7        10.2               34.8   
1   London   1047.0     65.8        13.5        12.7               54.7   
2   London   1319.0     44.6         9.7         9.1               39.3   
3   London    982.0     41.7         8.7         8.8               37.7   
4   London    817.0     49.6        10.8         9.4               45.9   

   ATT8SCR_NFSM6CLA1A  ATT8SCR_BOYS  ATT8SCR_GIRLS  P8MEA  P8MEA_FSM6CLA1A  \
0                59.2          51.5           46.8  -0.16            -0.99   
1                72.0           NaN           65.8   0.77             0.25   
2                49.0          41.9           47.1  -0.03            -0.18   
3                48.5          40.1           43.7  -0.28            -0.44   
4                52.2           NaN           49.6   0.06            -0.20   

   P8MEA_NFSM6CLA1A  PTFSM6CLA1A  PTNOTFSM6CLA1A  PNUMEAL  PNUMENGFL  \
0              0.34         37.0            63.0     23.6       70.5   
1              1.13         36.0            64.0     25.5       73.7   
2              0.09         45.0            55.0     38.1       61.9   
3              0.02         63.0            37.0     57.5       41.9   
4              0.25         41.0            59.0     50.6       46.0   

   PTPRIORLO  PTPRIORHI   NORB   NORG  PNUMFSMEVER  PERCTOT  PPERSABS10  \
0       15.0       33.0  765.0  398.0         39.6      8.1        24.7   
1        5.0       50.0  139.0  908.0         30.3      4.5         6.6   
2       21.0       18.0  681.0  638.0         51.3      8.2        24.0   
3       30.0       15.0  559.0  423.0         69.8     10.1        33.1   
4       16.0       28.0   40.0  777.0         42.7     10.3        33.8   

             SCHOOLTYPE.x         RELCHAR       ADMPOL.y      ADMPOL_PT  \
0  State-funded secondary  Does not apply  Non-selective  OTHER NON SEL   
1  State-funded secondary             NaN  Non-selective  OTHER NON SEL   
2  State-funded secondary  Does not apply  Non-selective  OTHER NON SEL   
3  State-funded secondary  Does not apply  Non-selective  OTHER NON SEL   
4  State-funded secondary  Roman Catholic  Non-selective  OTHER NON SEL   

  gender_name OFSTEDRATING         MINORGROUP   easting  northing  
0       Mixed         Good  Maintained school  528962.0  185931.0  
1       Girls         Good  Maintained school  529441.0  184659.0  
2       Mixed         Good  Maintained school  524402.0  185633.0  
3       Mixed         Good  Maintained school  528159.0  184498.0  
4       Girls         Good  Maintained school  528379.0  186191.0  

The metadata of this data is available here. For convenience, the columns are described as below:
1. URN: Unique Reference Number identifying a school.
2. SCHNAME.x: School name as recorded in the official register.
3. LEA: Local Education Authority (code).
4. LANAME: Name of the Local Authority.
5. TOWN.x: Town in which the school is located.
6. gor_name: Government Office Region name.
7. TOTPUPS: Total number of pupils on roll.
8. ATT8SCR: Average Attainment 8 score for all pupils.
9. ATT8SCRENG: Average Attainment 8 score for English subject grouping.
10. ATT8SCRMAT: Average Attainment 8 score for Maths subject grouping.
11. ATT8SCR_FSM6CLA1A: Average Attainment 8 score for pupils eligible for Free School Meals in the last 6 years and/or Children Looked After.
12. ATT8SCR_NFSM6CLA1A: Average Attainment 8 score for pupils not eligible for Free School Meals in the last 6 years and not Children Looked After.
13. ATT8SCR_BOYS: Average Attainment 8 score for male pupils.
14. ATT8SCR_GIRLS: Average Attainment 8 score for female pupils.
15. P8MEA: Progress 8 measure for all pupils.
16. P8MEA_FSM6CLA1A: Progress 8 measure for disadvantaged pupils (FSM6 and/or CLA1A).
17. P8MEA_NFSM6CLA1A: Progress 8 measure for non-disadvantaged pupils.
18. PTFSM6CLA1A: Percentage of pupils eligible for FSM6 and/or CLA1A.
19. PTNOTFSM6CLA1A: Percentage of pupils not eligible for FSM6 and/or CLA1A.
20. PNUMEAL: Percentage of pupils whose first language is known or believed to be other than English.
21. PNUMENGFL: Percentage of pupils whose first language is English.
22. PTPRIORLO: Percentage of pupils with low prior attainment from Key Stage 2.
23. PTPRIORHI: Percentage of pupils with high prior attainment from Key Stage 2.
24. NORB: Number of boys on roll.
25. NORG: Number of girls on roll.
26. PNUMFSMEVER: Percentage of pupils who have been eligible for free school meals in the past six years (FSM6 measure).
27. PERCTOT: Percentage of total pupil absence (authorised and unauthorised combined).
28. PPERSABS10: Percentage of pupils who are persistently absent (overall absence rate 10% or more).
29. SCHOOLTYPE.x: Official classification of the school type (e.g., Academy, Community, Voluntary Aided).
30. RELCHAR: Religious character of the school (e.g., Church of England, Roman Catholic, None).
31. ADMPOL.y: Admission policy type (e.g., non-selective, selective).
32. ADMPOL_PT: Percentage breakdown related to admission policy (context-specific).
33. gender_name: Gender intake of the school (Mixed, Boys, Girls).
34. OFSTEDRATING: Latest Ofsted inspection overall effectiveness grade.
35. MINORGROUP: Ethnic minority group classification for aggregation purposes.
36. easting: Ordnance Survey Easting coordinate of school location.
37. northing: Ordnance Survey Northing coordinate of school location.

To demonstrate, we will focus on 6 columns of this dataset. 1. LANAME: local authority name 2. SCHNAME.x: school name 3. ATT8SCR: Average Attainment 8 score per pupil (0-100) 4. PERCTOT: Percentage of total absence 5. PTFSM6CLA1A: Percentage of pupils eligible for FSM6 (free school meals at any point in the last six years) and/or CLA1A (proxy for socio-economic disadvantage). CLA1A stands for Children Looked After (for at least one day), which identifies pupils who have been in the care of a local authority. 6. PNUMEAL: Percentage of pupils whose first language is known or believed to be other than English.

# Extract the selected columns
df_school_reduced = df_school[['LANAME', 'SCHNAME.x', 'ATT8SCR', 'PERCTOT', 'PTFSM6CLA1A', 'PNUMEAL']]

# Display the first few rows
print(df_school_reduced.head())
   LANAME                                  SCHNAME.x  ATT8SCR  PERCTOT  \
0  Camden                     Acland Burghley School     50.3      8.1   
1  Camden                The Camden School for Girls     65.8      4.5   
2  Camden                           Hampstead School     44.6      8.2   
3  Camden                          Haverstock School     41.7     10.1   
4  Camden  La Sainte Union Catholic Secondary School     49.6     10.3   

   PTFSM6CLA1A  PNUMEAL  
0         37.0     23.6  
1         36.0     25.5  
2         45.0     38.1  
3         63.0     57.5  
4         41.0     50.6  

Second attempt: using Spearman correlation

Can we apply Spearman correlation to these two variables? Definitely - Spearman correlation is applicable to continuous variables.

Following a similar procedure:

# Drop rows with NaNs in either column
mask = ~np.isnan(df_school_reduced['PERCTOT']) & ~np.isnan(df_school_reduced['ATT8SCR'])

print(f"#all records: {df_school_reduced.shape[0]}")
print(f"#records with no NaN in both PERCTOT and ATT8SCR: {len(mask[mask==True])}")

x = df_school_reduced.loc[mask, 'PERCTOT']
y = df_school_reduced.loc[mask, 'ATT8SCR']

corr_coef, p_value = spearmanr(x, y)

print(f"Spearman correlation coefficient:{corr_coef:.2f}")
print(f"Two-tailed p-value:{p_value:.2f}")
#all records: 3056
#records with no NaN in both PERCTOT and ATT8SCR: 2969
Spearman correlation coefficient:-0.77
Two-tailed p-value:0.00

The results are quite similar to Pearson correlation: - Spearman correlation coefficient of -0.77 indicates a negative correlation between these two variables. - The p value is smaller than 0.01, indicating statistical significance.

Calculating Pearson correlation for all pairs of variables

The df_school_reduced dataframe contains four continous variables, and there are 6 possible pairs of variables.

Is it possible to compute and visualise the correlation of all pairs in one line?

The solution is to create and plot a correlation matrix, using pandas.

import matplotlib.pyplot as plt

# Calculate the correlation matrix
correlation_matrix = df_school_reduced.corr(numeric_only=True)

# Plot the correlation matrix
plt.figure(figsize=(8, 6))
plt.matshow(correlation_matrix, cmap='coolwarm', fignum=False)  # fignum=False prevents new fig creation
plt.colorbar()
plt.title('Correlation Matrix', pad=20)

# Add axis tick labels
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)

plt.tight_layout()
plt.show()

The plot shows pairwise correlation between the four continous variables, as the norminal variabels of LNAME and SCHNAME.x are excluded by the setting of numeric_only=True in corr().

This matrix or heatmap is symmetric, since the Pearson correlation is symmetric between the two variables.

The red colour on the diagonal line indicates correlation=1, which is perfect positive correlation.

So, what does this matrix tell us about the variables’ correlation? - Attainment 8 score negatively correlates with % overall absence, % disadvantaged pupils, and % pupils where English not first language. - % overall absence positively correlates with % disadvantaged pupils and negatively correlates with % pupils where English not first language. The latter correlation seems to contradict the results of a study that reports primary school pupils who use EAL are more likely to be persistently absent for 10% or more of lessons than pupils who speak English as their first language (see link). This needs to be confirmed by more investigation. - % disadvantaged pupils is positively associated with % pupils where English not first language.

Interestingly, in the code above, we didn’t deal with the NaN values. The reason is that, using DataFrame.corr(), Pearson, Kendall, and Spearman correlation are currently computed using pairwise complete observations. In other words, this function deals with incomplete observations internally.

A limitation of the code above is that it doesn’t calculate the p-value for each correlation. To calculate both correlation coefficients and p values in a batch, we can use a new package called pingouin.

If you want to try out the code below, you would need to install pingouin using pip install pingouin on a terminal or !pip install pingouin in a Python notebook.

# !pip install pingouin
import pingouin as pg

# Calculate correlation matrix with p-values
corr_results = pg.pairwise_corr(df_school_reduced, columns=df_school_reduced.select_dtypes(float).columns, method='pearson')

print(corr_results)
             X            Y   method alternative     n         r  \
0      ATT8SCR      PERCTOT  pearson   two-sided  2969 -0.716090   
1      ATT8SCR  PTFSM6CLA1A  pearson   two-sided  2970 -0.586077   
2      ATT8SCR      PNUMEAL  pearson   two-sided  2970  0.074745   
3      PERCTOT  PTFSM6CLA1A  pearson   two-sided  2969  0.489657   
4      PERCTOT      PNUMEAL  pearson   two-sided  3029 -0.120967   
5  PTFSM6CLA1A      PNUMEAL  pearson   two-sided  2970  0.399959   

            CI95%          p-unc        BF10     power  
0   [-0.73, -0.7]   0.000000e+00         inf  1.000000  
1  [-0.61, -0.56]  1.537169e-273  2.454e+269  1.000000  
2    [0.04, 0.11]   4.554293e-05      93.311  0.982981  
3    [0.46, 0.52]  7.284291e-179  7.175e+174  1.000000  
4  [-0.16, -0.09]   2.399714e-11   1.097e+08  0.999999  
5    [0.37, 0.43]  1.654185e-114  4.269e+110  1.000000  

Moving to ANOVA - testing difference in LADs

# Filter dataframe
df_filtered = df_school_reduced[df_school_reduced['LANAME'].isin(['Camden', 'Westminster', 'Islington'])]

# KDE plot for each group
plt.figure(figsize=(8, 6))
sns.kdeplot(data=df_filtered[df_filtered['LANAME'] == 'Camden']['ATT8SCR'], label='Camden', fill=True)
sns.kdeplot(data=df_filtered[df_filtered['LANAME'] == 'Westminster']['ATT8SCR'], label='Westminster', fill=True)
sns.kdeplot(data=df_filtered[df_filtered['LANAME'] == 'Islington']['ATT8SCR'], label='Islington', fill=True)
plt.title('KDE Plot of ATT8SCR by LANAME')
plt.xlabel('ATT8SCR')
plt.ylabel('Density')
plt.legend()
plt.show()

The KDE plots of ATT8SCR for Camden, Westminster, and Islington overlap considerably, which indicates similar score distributions.

We will use ANOVA to formally test if they are different.

# ---------------------
# Fit ANOVA model
# ---------------------
model = ols('ATT8SCR ~ C(LANAME)', data=df_filtered).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("ANOVA Table:\n", anova_table, "\n")

# ---------------------
# Assumption Tests
# ---------------------
# Normality test – Shapiro-Wilk
shapiro_test = stats.shapiro(model.resid)
print("Shapiro-Wilk test for normality:")
print(f"Statistic={shapiro_test.statistic:.4f}, p-value={shapiro_test.pvalue:.4f}")
if shapiro_test.pvalue > 0.05:
    print("Residuals appear normally distributed.\n")
else:
    print("Residuals may not be normally distributed.\n")

# Homogeneity of variances – Levene's test
levene_test = stats.levene(
    df_filtered[df_filtered['LANAME'] == 'Camden']['ATT8SCR'],
    df_filtered[df_filtered['LANAME'] == 'Westminster']['ATT8SCR'],
    df_filtered[df_filtered['LANAME'] == 'Islington']['ATT8SCR']
)
print("Levene’s test for homogeneity of variances:")
print(f"Statistic={levene_test.statistic:.4f}, p-value={levene_test.pvalue:.4f}")
if levene_test.pvalue > 0.05:
    print("Variances appear homogeneous.\n")
else:
    print("Variances may not be homogeneous.\n")


# Visualization
# QQ plot of residuals
plt.figure(figsize=(6, 6))
sm.qqplot(model.resid, line='45', fit=True)
plt.title('QQ Plot of Residuals')
plt.show()
ANOVA Table:
                 sum_sq    df         F    PR(>F)
C(LANAME)   335.630586   2.0  3.258313  0.054611
Residual   1339.097000  26.0       NaN       NaN 

Shapiro-Wilk test for normality:
Statistic=0.9513, p-value=0.1976
Residuals appear normally distributed.

Levene’s test for homogeneity of variances:
Statistic=0.6877, p-value=0.5117
Variances appear homogeneous.
<Figure size 576x576 with 0 Axes>

  • The ANOVA tests C(LANAME), which represent the difference in mean ATT8SCR scores between the three local authorities (Camden, Westminster, Islington).
  • The F statistic = 3.26 is the ratio of variance between the group means to the variance within the groups.
  • Degrees of Freedom
    • Between groups: df = 2 (because there are 3 groups: df = k−1).
    • Within groups (residual): df = 26 (total sample size minus k = 29 - 3).
  • p-value: PR(>F) = 0.0546 means that under the null hypothesis (no difference in means), there is a 5.46% probability of obtaining an F statistic as large as or larger than 3.26. If we use a standard alpha level of 0.05, this p-value is on the margin.
  • The Shapiro-Wilk test indicates that the residuals are normally distributed.
  • The Levene test indicates the variances within each group are homogeneous.
  • The QQ plot of residuals shows that the residuals follow a normal distribution and there are no obvious outliers.

To visualise the distribution of F-statistic and the p-value, we can use the following plot:

# Extract F statistic, p-value, and degrees of freedom from ANOVA table
F_stat = anova_table['F'][0]
p_value = anova_table['PR(>F)'][0]
df_between = anova_table['df'][0]  # numerator degrees of freedom
df_within = anova_table['df'][1]   # denominator degrees of freedom

# Create range for F distribution
x = np.linspace(0, max(6, F_stat + 2), 500)  # ensure range includes F_stat with padding
y = stats.f.pdf(x, df_between, df_within)

# Create the plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, 'k-', label=f'F-distribution (df={int(df_between)}, {int(df_within)})')

# Set y-axis minimum to 0.0
plt.ylim(bottom=0.0)

# Shade the area corresponding to the p-value (upper tail)
x_fill = np.linspace(F_stat, x.max(), 200)
y_fill = stats.f.pdf(x_fill, df_between, df_within)
plt.fill_between(x_fill, y_fill, color='red', alpha=0.4, label=f'p-value = {p_value:.4f}')

# Mark the F statistic
plt.axvline(F_stat, color='blue', linestyle='--', linewidth=2,
            label=f'F statistic = {F_stat:.2f}')

# Labels and title
plt.title('F Distribution with Computed F Statistic and p-value Region')
plt.xlabel('F value')
plt.ylabel('Density')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
/tmp/ipykernel_23611/1743805570.py:2: FutureWarning:

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

/tmp/ipykernel_23611/1743805570.py:3: FutureWarning:

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

/tmp/ipykernel_23611/1743805570.py:4: FutureWarning:

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

/tmp/ipykernel_23611/1743805570.py:5: FutureWarning:

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

We can also use pingouin to conduct ANOVA between LANAME and ATT8SCR. The grammar is much more concise than statsmodels.

# One-way ANOVA using Pingouin
anova_results = pg.anova(dv='ATT8SCR', between='LANAME', data=df_filtered, detailed=True)

print(anova_results.round(2))
   Source       SS  DF      MS     F  p-unc  np2
0  LANAME   335.63   2  167.82  3.26   0.05  0.2
1  Within  1339.10  26   51.50   NaN    NaN  NaN

The results are aligned with ANOVA using statsmodels.

Column meanings:

  • Source: Factor being tested (“LANAME”) or residual.
  • SS: Sum of squares. The SS of LANMAE (335.63) represents the between-group variation, while SS of Within (1339.10) represents the within-group variation.
  • DF: Degrees of freedom.
  • MS: Mean squares.
  • F: F-statistic, calculated as: $ [ F(2, 26) = = = ] $
  • p-unc: Uncorrected p-value.
  • np2: Partial eta-squared (effect size).

You’re Done!

Congratulations on completing the practical on correlation and ANOVA. You’ve also learnt how to debug when an error occurs.

When you use Python (or R), it is common that a task can be completed by different packages. For example, Pearson correlation can be calculated using scipy, pandas, or pingouin. Choose the right tool for your purpose, and ensure you understand the underlying principles and limitations before using a package.

It takes time to learn - remember practice makes perfect!

If you have time, please think about applying the correlation to other variables in the school data, or your own datasets.