Practical 10: Testing ML Systems with pytest
This week we focus on using pytest to test the code and system behaviour of an ML project. We will use the example that predicts daily call-outs of London Fire Brigade (LFB) based on weather and calendar data.
This example was introduced in Week 4 practical. The codebase is structured into a firepredict package and a tests folder, each containing Python files. The tests are written using pytest, which is a popular testing framework in Python.
Please note that this practical contains a lot of command-line instructions. You can either run these commands in a terminal, or using magic of ! in Jupyter notebooks.
Learning outcomes
In this practical, we will:
- Practise using pytest for automated testing.
- Run test coverage and interpret missing coverage.
Starting the Practical
The process for this week is similar with previous weeks: download the notebook to your DSSS folder (or wherever you keep your course materials), switch over to JupyterLab (running in Podman/Docker), and work through each section.
If you want to save the completed notebook to your Github repo, remember to add, commit, and push your work.
Suggestions:
- Keep software language set to English for easier debugging.
- Back up work using Git/cloud storage.
- Avoid spaces in file and column names.
Quick recap: why testing ML systems?
Testing in ML is about testing the code and system behaviour, rather than train-test split on the dataset.
Common test types
- Unit tests: test one function/class in isolation.
- Integration tests: test interactions across components.
- System tests: test end-to-end workflows.
- Acceptance tests: verify requirements from a user/business perspective.
Good testing principles
- Keep tests separate from core code (
tests/folder) to avoid polluting production code. - Automate tests so they can run in CI/CD.
- Use the AAA pattern: Arrange → Act → Assert.
Fetch the repository
The dataset and code for this practical are stored in the ML-Testing repository on GitHub. You can clone it using either SSH or HTTPS. Run the following commands in Jupyter, or open a terminal and run them there (without !).
!# clone with SSH
!git clone git@github.com:huanfachen/ML-Testing.gitIf the above command fails, it means SSH is not configured on your machine. Then, please use HTTPS instead:
!git clone https://github.com/huanfachen/ML-Testing.gitNow print the file tree of the ML-Testing folder:
!cd ML-Testing && find . -maxdepth 3 -printYou should be able to see key folders such as:
firepredict/__init__.py(this defines the ‘firepredict’ package and loads environment variables)data.py(containing data loading and split functions)train.py(containing model training functions)tune.py(containing hyperparameter tuning functions)evaluate.py(containing model evaluation functions)main.py(containing the script to run the full pipeline. Don’t test this file)
tests/test_data.pytest_train.pytest_tune.pytest_evaluate.py
What tests are included?
You can open each test file in the tests/ folder to see the tests included. For example, test_data.py contains tests for the data loading and splitting functions in data.py:
- test_load_data_returns_dataframe (load_data should return a non-empty DataFrame)
- test_load_data_has_target_column (Raw dataset must contain the target column ‘IncidentCount’)
- test_load_data_has_feature_columns (Raw dataset must contain all expected feature columns)
- test_load_data_no_duplicate_rows (Dataset should have no fully duplicated rows)
- test_preprocess_returns_tuple (preprocess should return a (DataFrame, Series) tuple)
- test_preprocess_encodes_weekday (After one-hot encoding, ‘weekday’ must not appear as a raw column)
- test_preprocess_creates_dummy_columns (One-hot encoding should produce at least one ’weekday_*’ column)
- test_preprocess_lengths_match (X and y must have the same number of rows)
- test_preprocess_no_nulls (Preprocessed features and target must be free of NaN values)
- test_preprocess_target_positive (Daily incident counts should be strictly positive integers)
- test_get_data_splits_sizes (Train + test should equal the full dataset; test should be ~20%)
- test_get_data_splits_columns_match (Train and test splits must have identical column sets)
- test_get_data_splits_reproducible (The same random_state should produce identical splits)
- test_get_data_splits_no_overlap (Train and test index sets must be disjoint)
It is very time consuming to write all these tests manually. Good news is that there are tools to help you generate test stubs (e.g. pytest --stub), and even to automatically generate tests using LLMs (e.g. GitHub Copilot). We used Github Copilot to generate these tests before reviewing them. We should always review and edit generated tests to ensure they are correct and meaningful.
Run pytest from the project root
From the root of ML-Testing, run the following command to execute all tests in the tests/ folder.
!python3 -m pytest tests/ -vIf all tests pass, you will see a summary similar to:
============================= test session starts =============================
...
collected XX items
tests/test_data.py ....
tests/test_train.py ...
tests/test_tune.py ....
tests/test_evaluate.py ....
============================== XX passed in Ys ===============================
If not all tests pass, you will see error messages indicating which tests failed and why. You can click on the file paths in the error messages to jump to the relevant test code.
Checkpoint questions
- Which test file took the longest time?
- Which tests are unit tests, and which are closer to integration/system tests?
Run coverage with pytest-cov
Now, we will check coverage of tests on the firepredict package.
!python3 -m pytest tests/ --cov=firepredict --cov-report=term-missing -vHere is the breakdown of each parameter:
python3: This invokes the Python 3 interpreter to run the subsequent commands.-m pytest: The-mflag stands for “module”. This tells Python to run thepytestmodule as a script. Usingpython3 -m pytestinstead of just typingpytestis considered best practice because it ensures you are using the version of pytest installed in your current Python environment.tests/: This is the target directory. It tells pytest to look inside thetests/folder and execute all the test files it finds there.--cov=firepredict: This invokes thepytest-covplugin. It instructs the tool to measure the code coverage specifically for thefirepredictpackage or directory. It will track which lines of code insidefirepredictare run during the tests.--cov-report=term-missing: This formats the output of the coverage data.termtells it to print the report directly to the terminal.missingadds an incredibly useful extra column to that report, listing the exact line numbers in your code that were not executed by your tests.-v: This stands for “verbose”. Instead of printing a single dot for every passing test, pytest will print the specific name of every single test file and test function it runs, along with its individual pass or fail status.
As said before, we want to skip testing main.py because it is just a script to run the full pipeline. To exclude it from coverage, we have included the following option in the pyproject.toml file. This configuration file is used to configure various tools in Python, including pytest and coverage.py.
# Pytest cov
[tool.coverage.run]
omit=["firepredict/main.py"]
If you also want an HTML report rather than a terminal report, you can run:
!python3 -m pytest tests/ --cov=firepredict --cov-report=html --cov-report=term-missing -vThen open the html report using xdg-open or open it manually in your browser.
!xdg-open htmlcov/index.htmlInterpreting missing coverage
The key part of the coverage report is the Missing column, see below. It says that Line 54 in tune.py is not covered by any test. This line is param_grid = DEFAULT_PARAM_GRID, which is about setting up the default hyperparameter grid for tuning.
----------- coverage: platform linux, python 3.9.6-final-0 -----------
Name Stmts Miss Cover Missing
-------------------------------------------------------
firepredict/__init__.py 2 0 100%
firepredict/data.py 22 0 100%
firepredict/evaluate.py 10 0 100%
firepredict/train.py 7 0 100%
firepredict/tune.py 11 1 91% 54
-------------------------------------------------------
TOTAL 52 1 98%
======================= 44 passed, 2 warnings in 11.38s ========================
To achieve 100% coverage, we need to add a test that covers Line 54 in tune.py. To do this, you can manually uncomment the test function test_default_param_grid in tests/test_tune.py (by removing the # at the beginning of Line 110-119) and run all tests again.
In general, when a file is not fully covered (e.g. 91%), common reasons are:
- one branch of an
ifstatement is never tested, - edge-case inputs are not included,
- fallback/error paths are not exercised.
To improve coverage, you can:
- Pick one file with incomplete coverage.
- Identify missing lines from the coverage report.
- Add one new pytest test to cover at least one missing path.
Summary
Well done! In this practical, you have practiced with pytest to test the code and system behaviour of an ML project and to generate coverage reports. You have also learned how to interpret missing coverage and add new tests to improve it.
If you are working in a data science team, you are likely expected to regularly write tests for your code and to maintain good test coverage.
References
This practical is inspired by the great Made-with-ML repo (https://github.com/GokuMohandas/Made-With-ML). It contains a much more complicated codebase and more comprehensive tests. Please check it out if you are interested.