Coming from a software testing background, I am a big supporter of testing. There are numerous testing techniques from unit, integration, testing in production, manual, automated, etc, but unit testing is the first line of defence against defects or regression.
A good unit test covers a small piece of code, runs quickly, and provides clear feedback to a developer. They have low resource demands and fit into local and CI workflows. These characteristics make them ideal to be run frequently and flexibly, allowing issues to be discovered and resolved early in the development cycle.
Figure 1: Development feedback cycle
I have been involved in a project that relies heavily on PySpark to create an extract, transform load (ETL) jobs that process huge amounts of data.
Figure 2: PySpark ETL Job
Once the code is written, the question remains – how do we unit test it?
I am going to take you on the journey, illustrating how I approached unit testing in PySpark with some of these ETL jobs, revealing the challenges and learnings along the way.
Given our deadlines, we had to strike a balance between delivery and the scope of testing. The team focussed on:
From that, we concluded that we are not interested in testing implementation details. Instead, we are interested in testing the code’s behaviour – does the unit do the right thing? This would give us focus when testing, ensuring we are efficient and meaningful with our assertions.
Going back to basics, a unit test typically tests a single class, method or function in isolation – so what/who defines the unit boundary?
Let’s look at some code to bring this question to life:
As I was new to PySpark, I wrote my entire job within one function – a single unit!
Testing this code would be almost impossible due to numerous reasons, including high coupling, shared state and dependency on external components.
The answer was to break the code into units. The units are many smaller functions that be tested independently. This led me to adopt aspects of the functional programming paradigm:
A better code structure looks like this:
Here, we asked questions around our test ethos:
We decided on synthetic data for three reasons; we can define data to exercise specifics of our test, security concerns and readability.
We decided against creating production-like dataframes. If the function on test only used 3 columns in a 20-column dataset, the other 17 columns just added complexity and made it difficult to reveal what was on test.
We decided on keeping data in code rather than files. Test data in code provided readability and relevance, whilst be curated for the specific test. For small datasets, the apparent simplicity of a file-based approach is quickly outweighed by the benefits of locally defined test data.
We found that defining data for each test was most effective. Sharing datasets added a significant amount of effort to test data management. Tests that share data often needed to include filtering logic, complicating the test.
This, in addition to adopting the “Arrange, Act, Assert” pattern, enhances the readability and understanding of what each test does.
All four of these points are closely related and are centred around our testing ethos. What do we think is important to us in our scenario; complexity, readability, repetition, exercising code paths/functionality, completeness etc. The balance of these factors helped us to align our thinking across all four questions.
I believe that with the above decisions we met the following properties of good unit tests:
The journey is not complete yet!
These are some of the next steps I am considering:
Now go and write some tests!
This article is also featured in Retail Insight's Medium publication.