Beatrice Taylor - beatrice.taylor@ucl.ac.uk
8th October 2025
Overview of lecture 1
When we’re working with large datasets, exploratory data analysis is the first step in the scientific method.
By the end of this lecture you should be able to:
How do we understand how likely events are to occur?
Question
What is the probability of someone at UCL being over 190cm?
How can we try to answer this?
We could try and find someone on campus who is over 190cm.
Better idea is to try and understand the distribution of heights.
Before we try to describe the data, it’s important to know where the data came from.
Ideally, we would like all the relevant data.
… in reality we normally only have some.
Hence, we sample a subset of the data. We need to choose our sample carefully - we want what happens in the sample to approximate what happens in the whole population.
In practise we might try different sampling approaches such as:
It’s important to understand if your dataset is unrepresentative or biased.
Systematic patterns in how we think about, and perceive, the world.
Our cognitive biases can impact:
If we’re not careful we can propagate bias to the research, and hence results.
This can lead to incorrect conclusions.
Types of dataset bias include: - Historical bias - Selection bias
Reflects existing, real world, inequalities
Examples:
When the sample chosen doesn’t represent the whole population of interest
Examples:
Probably not.
Failing that… .. we can acknowledge our biases!
Image credit: [xkcd](https://xkcd.com/2494/)
Descriptive statistics refers to the most basic statistical information about the dataset.
Let’s look at a dataset of students height.
Easy to print the summary statistics in Python, using pandas
:
Height_cm | |
---|---|
count | 1000.00 |
mean | 161.19 |
std | 9.79 |
min | 128.59 |
25% | 154.52 |
50% | 161.25 |
75% | 167.48 |
max | 199.53 |
Sometimes we need more information.
x | y | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
dataset | ||||||||||||||||
1 | 11.0 | 9.0 | 3.316625 | 4.0 | 6.5 | 9.0 | 11.5 | 14.0 | 11.0 | 7.500909 | 2.031568 | 4.26 | 6.315 | 7.58 | 8.57 | 10.84 |
2 | 11.0 | 9.0 | 3.316625 | 4.0 | 6.5 | 9.0 | 11.5 | 14.0 | 11.0 | 7.500909 | 2.031657 | 3.10 | 6.695 | 8.14 | 8.95 | 9.26 |
3 | 11.0 | 9.0 | 3.316625 | 4.0 | 6.5 | 9.0 | 11.5 | 14.0 | 11.0 | 7.500000 | 2.030424 | 5.39 | 6.250 | 7.11 | 7.98 | 12.74 |
4 | 11.0 | 9.0 | 3.316625 | 8.0 | 8.0 | 8.0 | 8.0 | 19.0 | 11.0 | 7.500909 | 2.030579 | 5.25 | 6.170 | 7.04 | 8.19 | 12.50 |
The most fundamental distribution
You’ve seen it all before
The probability density function (PDF) describes the likelihood of different outcomes for a continuous random variable
A random variable is a way to map the outcome of a random process to a probability.
In mathematical notation, a random variable \(X\) is approximately normally distributed about a mean of \(\mu\) with a standard deviation of \(\sigma\):
\[\begin{align} X \sim N(\mu,\sigma) \end{align}\]The distribution of the random variable when derived from a random sample of size \(n\)
In the case of the normal distribution the standard deviation becomes:
\[\begin{align} \frac{\sigma}{\sqrt{n}} \end{align}\]In practise as the sample size increases the sampling distribution becomes more and more centralised.
Can use the PDF to evaluate the probability at a specific point.
\[\begin{align} X \sim N(0,1) \end{align}\]Many real world datasets are approximately normally distributed.
But not all
Continuous data Measurable data which can take any value within a given range.
example: height
Discrete data Measurable data which can take seperate, countable values.
example: shoe size
Having a function for the distribution allows us to evaluate the probability of events, and hence evaluate hypotheses.
For discrete distributions we have the probability mass function.
Sampling distributions
As for the normal distribution, in the general case we should be aware of the sampling distribution.
Where \(n\) is the number of trials, and \(p\) is the probability of success for each trial.
The probability mass function (PMF) describes the likelihood of different outcomes for a discrete random variable
Evaluating the PMF, we get:
\[\begin{align} P(X \geq 7) &= P(X=7) + P(X=8) + P(X=9) + P(X=10) \\ &= 0.1719 \end{align}\]Where \(\lambda\) is the expected number of events in a given interval.
If the Poisson measures the probability of \(x\) events within a time period, then the exponential measures how long we are likely to wait between events.
The greatest shortcoming of the human race is our inability to understand the exponential function – Albert Bartlett (physicist)
Image credit: https://simple.wikipedia.org/wiki/Chaturanga
. . .
10^18
. . .
Which is more than the global production of rice.
The (natural) exponential function is:
\[\begin{align} y=e^x \end{align}\]Note
\(e\) here is eulers number - a mathematical constant.
\[\begin{align} e \approx 2.718... \end{align}\]The PDF of the exponential distribution is:
\[\begin{align} f(x, \lambda) = \lambda e^{-\lambda x} \end{align}\]where \(\lambda\) is the rate parameter. As in the poisson distribution \(\lambda\) is the fixed rates of events for a predetermined time interval.
Note
This is sometimes referred to as the negative exponential distribution - as it’s a negative exponent.Traditionally used to model time between rare events.
The exponential allows us to answer questions like:
This is easier said than done – the best way is to invert the equation.
The mathematical operation that reverses.
Subtract is the inverse of adding.
\[\begin{align} 2 + x=5 \implies 5-2=x \end{align}\]Divide is the inverse of multiplying.
\[\begin{align} 2 \times x =6 \implies 6 \div 2=x \end{align}\]Taking the logarithm is the inverse of taking the exponential.
More generally:
\[\begin{align} a^x = b \implies \log_a(b) =x \end{align}\]For the natural logarithm:
\[\begin{align} e^x = b \implies \log_e(b) =ln(b) = x \end{align}\]There are some general rules for how we apply logarithms:
\[\begin{align} log_a(b \times c) &= log_a(b) + log_a(c) \\ log_a(\frac{b}{c}) &= log_a(b)-loc_b(c) \\ log_a(b^c) &= c \times log_a(b) \\ log_a(1)&=0 \\ log_a(a)&=1 \end{align}\]Some of the most important rules:
\[\begin{align} log_a(a^x) = x \\ ln(e^x) = x \end{align}\]When we have exponential data we can take the logarithm of it - and hence simplify it.
We’ve covered:
The practical will focus on exploratory data analysis for a variety of different datasets.
Have questions prepared!
© CASA | ucl.ac.uk/bartlett/casa