Common Data Cleaning Mistakes... and How to Prevent Them

Recently, I’ve been thinking a lot about mistakes. Mistakes in how we store, handle and clean data can be extremely costly, but these mistakes are also incredibly easy to make. What’s more, data mistakes are often be semi-invisible; they don’t result in a website failing to load, or code not running. Instead, we obtain some data, an analysis or model is produced - it’s just the wrong data, an inaccurate model, or the wrong conclusion.

So how do we identify and prevent errors before they cost us? Compulsory code review can certainly help us to avoid mistakes. But relying solely on code review to detect errors is a risky strategy that places heavy burden on the reviewer.

That’s why, as much as possible, it’s good to write tests and assertations that systematically check for common data mistakes.

Some things are undoubtedly easier to test than others; it’s easier to check that some analysis code does what you think it does than to test that you’ve chosen the correct statistical tests (although you could write tests to check that the assumptions of a statistical test are met).

On the other hand, data cleaning is a part of the analysis process that I believe lends itself more naturally to tests.

It’s not easy to anticipate every potential source of error. However, there are at least some mistakes that I believe crop up commonly, that I’ve definitely made myself, and that I’ve watched many others make:

  • Row Explosion During Joins
  • Accidental Filtering of Null/NAs
  • Case-When Categorisation Mistakes
  • WHERE Statements in SQL That Aren’t What They Seem

None of these mistakes are unavoidable - so as well as explain the mistake with a worked example, I’m going to go through the tests I use to identify these mistakes in my own code.

To explain what I mean by each of these mistakes, and show how they can be prevented, I’m going to use two small, fake data sets. First, the users table, a table of imaginary users who signed up to an imaginary website on either a trial or paying plan.

users %>% 
  head(5)
## # A tibble: 5 x 4
##   user_id signup_date   age plan                 
##     <dbl> <chr>       <dbl> <chr>                
## 1       1 01/01/2020     21 Trial                
## 2       2 01/01/2020     45 Trial (extended)     
## 3       3 03/01/2020     33 Premium Plus         
## 4       4 04/01/2020     19 Premium Plus (Weekly)
## 5       5 04/01/2020     NA Premium

Second, a table of purchases made by our imaginary users, in a table called payments.

payments %>% 
  head(5)
## # A tibble: 5 x 3
##   user_id payment_date amount
##     <dbl> <chr>         <dbl>
## 1       3 04/01/2020    200. 
## 2       4 01/08/2020     34.7
## 3       5 09/01/2020    160. 
## 4       7 13/01/2020     69.8
## 5       9 13/01/2020     69.8

Without further avail, let’s look at these mistakes.

Mistake 1: Row Explosion During Joins

Example Problem

Suppose I want to join the purchases table to the users table, to find out which users have made purchases, and how much each user has spent.

The user_id column exists in both tables, so I use this column to perform a left join of the purchases onto the users table:

users %>% 
  left_join(payments, by = "user_id") %>% 
  head(5)
## # A tibble: 5 x 6
##   user_id signup_date   age plan                  payment_date amount
##     <dbl> <chr>       <dbl> <chr>                 <chr>         <dbl>
## 1       1 01/01/2020     21 Trial                 <NA>           NA  
## 2       2 01/01/2020     45 Trial (extended)      <NA>           NA  
## 3       3 03/01/2020     33 Premium Plus          04/01/2020    200. 
## 4       4 04/01/2020     19 Premium Plus (Weekly) 01/08/2020     34.7
## 5       4 04/01/2020     19 Premium Plus (Weekly) 15/01/2020     34.7

At first glance, everything looks fine. But, if we compare the row count before and after the join, we can see that the row count has actually increased after the left join:

# row count before join
users %>% 
  nrow()
## [1] 20
# row count after join
users %>% 
  left_join(payments, by = "user_id") %>% 
  nrow()
## [1] 24

So, what’s happened?

At first glance, the user_id field in purchases may appear to be a primary key. But actually, a user can make more than one payment, and so appear multiple times in the table.

When we left join the purchases table to the users table, the entries for these users are duplicated in users to account for the multiple payments:

users %>% 
  filter(user_id %in% c(4, 10)) %>% 
  left_join(payments, by = "user_id") %>% 
  head(5)
## # A tibble: 5 x 6
##   user_id signup_date   age plan                  payment_date amount
##     <dbl> <chr>       <dbl> <chr>                 <chr>         <dbl>
## 1       4 04/01/2020     19 Premium Plus (Weekly) 01/08/2020     34.7
## 2       4 04/01/2020     19 Premium Plus (Weekly) 15/01/2020     34.7
## 3       4 04/01/2020     19 Premium Plus (Weekly) 21/01/2020     34.7
## 4      10 13/01/2020     51 Basic (Weekly)        15/01/2020     12.0
## 5      10 13/01/2020     51 Basic (Weekly)        21/01/2020     12.0

Usually this mistake is made because there is a column that looks like it should be a primary key, but that turns out not to be. Admittedly, in the example I’ve given, assuming that user_id is a primary key in the purchases table might seem oviously misguided. However, I promise you, there will be situations less clear-cut - and I have definitely seen this mistake made many times.

Error Detection

There are two tests that can be useful to detect and avoid this mistake.

One check that we can do is to look at the row count before and after joining the tables, to ensure that the row count has changed (or not changed) as expected.

For instance, in my example, I’m expecting the row count to remain the same after joining the payments table. I’d say this is a common situation when performing a left join.

This means that I could identify my mistake by checking that the number of rows in the users table is the same before and after the join.

Although I could write a function in my code to test that the number of rows hasn’t changed, there are already some great packages that enable easy testing. Here, I’m using assertthat, but you could also look at either assertr or testthat. Likewise, there are similar libraries in Python (e.g. pytest).

Note that to enable my code to run, I’m using the see_if() function from the package, which returns a logical value, with the error message as an attribute, but doesn’t stop my code from running.

In actual production code though, use assert_that(), which will return an error and stop the code from running any further.

user_summary <- users %>% 
  left_join(payments, by = "user_id")

see_if(are_equal(nrow(users), nrow(user_summary)))
## [1] FALSE
## attr(,"msg")
## [1] "nrow(users) not equal to nrow(user_summary)"
# in production code, replace see_if() with assert_that():
# assert_that(nrow(users) == nrow(user_summary))

(In my code, assertthat is simply testing the condition that I’ve stipulated, to help me detect my mistake.)

Another check that’s useful in this context is to test whether the primary key is still unique after performing the join.

For instance, in my example, the user_id columns contains unique values - and a test would tell me that this is not the case after performing my join.

see_if(are_equal(length(user_summary$user_id), length(unique(user_summary$user_id))))
## [1] FALSE
## attr(,"msg")
## [1] "length(user_summary$user_id) not equal to length(unique(user_summary$user_id))"

Mistake 2: Accidental Filtering of Nulls/NAs

Example Problem

Suppose I want to filter out all of the users on a ‘Premium Plus’ plan from my users table. All I need to do is to filter for any plan that’s not equal to ‘Premium Plus’, right?

users %>% 
  filter(plan != "Premium Plus")
## # A tibble: 17 x 4
##    user_id signup_date   age plan                 
##      <dbl> <chr>       <dbl> <chr>                
##  1       1 01/01/2020     21 Trial                
##  2       2 01/01/2020     45 Trial (extended)     
##  3       4 04/01/2020     19 Premium Plus (Weekly)
##  4       5 04/01/2020     NA Premium              
##  5       7 08/01/2020     21 Basic                
##  6       8 12/01/2020     25 Trial                
##  7       9 13/01/2020     36 Basic                
##  8      10 13/01/2020     51 Basic (Weekly)       
##  9      11 13/01/2020     32 Basic Plus           
## 10      13 14/01/2020     33 Basic Plus           
## 11      14 15/01/2020     24 Trial (extended)     
## 12      15 16/01/2020     29 Basic                
## 13      16 16/01/2020     22 Premium              
## 14      17 20/01/2020     41 Basic Plus           
## 15      18 21/01/2020     18 Premium              
## 16      19 22/01/2020     20 Trial                
## 17      20 22/01/2020     NA Basic

Looking at this table, I have indeed filtered out the Premium Plus users…

But the table only had 1 Premium Plus user. And when I count the number of rows left in my table, I’ve gone from 20 rows to 17 rows!

users %>% 
  filter(plan != "Premium Plus") %>% 
  nrow()
## [1] 17

As it turns out, some users in the table have a NULL entry for plan. And, while filtering out the Premium Plus users, I’ve filtered out all of these users too!

This happens because R doesn’t know what to do with these cases. If the value is NULL then it’s impossible to evaluate the truthiness of the statement. So, it just gets removed. (FYI this isn’t just an ‘R’ thing, this is usual behaviour for any coding language).

You might say, ‘but this shouldn’t happen! The plan column should never be NULL!’.

Sure, it probably shouldn’t be. But it is. Welcome to data in the real world!

Error Detection

Here again, we can test that this error has not occurred by checking the row count before and after the filter.

First, we need to find out how many rows match the condition we want to filter out (in this case, the number of users with a Premium Plus plan).

n_premium_plus <- users %>% 
  filter(plan == "Premium Plus") %>% 
  nrow()

We then need to check that we’ve only removed this number of rows after filtering.

filtered_users <- users %>% 
  filter(plan != "Premium Plus")

see_if(are_equal(nrow(filtered_users), nrow(users) - n_premium_plus))
## [1] FALSE
## attr(,"msg")
## [1] "nrow(filtered_users) not equal to nrow(users) - n_premium_plus"

And, that’s not it! We can also write tests to ensure that there are no null values in the plan to begin with:

see_if(noNA(users$plan))
## [1] FALSE
## attr(,"msg")
## [1] "users$plan contains 2 missing values"

(The code above counts the number of NULLs in the column to ensure that this count is equal to zero).

Mistake 3: Case-When Categorisation Mistakes

Example Problem

This problem occurs when you write a series of CASE… WHEN statements to categorise column entries, but the order of your CASE statements means that one of your CASE conditions is never met.

For instance, in my users table there are a number of different plans, some of which seem like variations on the same plan (for e.g. ‘Premium’ and ‘Premium (Weekly)’. I want to re-categorise the plans into a smaller set of categories, so I write a case when statement:

users %>% 
  mutate(plan_type = case_when(
    str_detect(plan, "Basic") ~ "Basic",
    str_detect(plan, "Premium") ~ "Premium",
    str_detect(plan, "Basic Plus") ~ "Basic Plus",
    str_detect(plan, "Premium Plus") ~ "Premium Plus",
    str_detect(plan, "Trial") | is.na(plan) ~ "Trial"
  ))
## # A tibble: 20 x 5
##    user_id signup_date   age plan                  plan_type
##      <dbl> <chr>       <dbl> <chr>                 <chr>    
##  1       1 01/01/2020     21 Trial                 Trial    
##  2       2 01/01/2020     45 Trial (extended)      Trial    
##  3       3 03/01/2020     33 Premium Plus          Premium  
##  4       4 04/01/2020     19 Premium Plus (Weekly) Premium  
##  5       5 04/01/2020     NA Premium               Premium  
##  6       6 07/01/2020     18 <NA>                  Trial    
##  7       7 08/01/2020     21 Basic                 Basic    
##  8       8 12/01/2020     25 Trial                 Trial    
##  9       9 13/01/2020     36 Basic                 Basic    
## 10      10 13/01/2020     51 Basic (Weekly)        Basic    
## 11      11 13/01/2020     32 Basic Plus            Basic    
## 12      12 14/01/2020     27 <NA>                  Trial    
## 13      13 14/01/2020     33 Basic Plus            Basic    
## 14      14 15/01/2020     24 Trial (extended)      Trial    
## 15      15 16/01/2020     29 Basic                 Basic    
## 16      16 16/01/2020     22 Premium               Premium  
## 17      17 20/01/2020     41 Basic Plus            Basic    
## 18      18 21/01/2020     18 Premium               Premium  
## 19      19 22/01/2020     20 Trial                 Trial    
## 20      20 22/01/2020     NA Basic                 Basic

This may not look immediately problematic, but let’s see what happens when I count up the number of rows in each plan_type:

users %>% 
  mutate(plan_type = case_when(
    str_detect(plan, "Basic") ~ "Basic",
    str_detect(plan, "Premium") ~ "Premium",
    str_detect(plan, "Basic Plus") ~ "Basic Plus",
    str_detect(plan, "Premium Plus") ~ "Premium Plus",
    str_detect(plan, "Trial") | is.na(plan) ~ "Trial"
  )) %>% 
  count(plan_type)
## # A tibble: 3 x 2
##   plan_type     n
##   <chr>     <int>
## 1 Basic         8
## 2 Premium       5
## 3 Trial         7

We can now see that there are no ‘Basic Plus’ and ‘Premium Plus’ plans - all of those plans have been lumped into the ‘Basic’ or ‘Premium’ categories.

The ‘Basic Plus’ plans all meet the first case condition since they all contain the word ‘Basic’. So, the case/when evaluates as TRUE, and these plans are all binned into the ‘Basic’ plan type.

In a case-when statement, as soon as a condition is met for a given row, the row is transformed accordingly and we move onto the next row.

Error Detection

A good starting point to identify this type of categorisation error is simply to count up the rows per category after the case-when is applied. If any categories are unexpectedly returning zero, this should be an indication that the categorisation has not been done as expected.

We could test this by asserting that none of the category counts are equal to zero.

However, I appreciate that this isn’t a failproof check; the count for a category could be greater than zero whilst still being incorrect, and sometimes we may have categories that should have a count of zero.

Mistake 4: The SQL Filter That Doesn’t Do What You Think

This is a different class of filtering error that I’ve come across when analysts work in SQL.

For instance, suppose I write the following SQL statement to re-categorise plans, and filter out trial plans:

'SELECT user,
        signup_date,
        age,
        CASE
          WHEN plan = "Basic Plus" THEN "Basic Plus"
          WHEN plan = "Premium Plus" THEN "Premium Plus"
          WHEN plan LIKE "%Basic%" THEN "Basic"
          WHEN plan LIKE "%Premium%" THEN "Premium"
          WHEN plan IS NULL OR plan LIKE "%Trial%" THEN "Trial"
          ELSE "Other"
        END AS plan
  FROM users
  WHERE plan NOT = "Trial"'

At first glance, this may not look problematic.

However, in SQL the WHERE command is processed before the SELECT command.

What this means is that I am filtering out plans called ‘Trial’ before re-categorising empty or null plans to ‘Trial’. As such, these plans are erroneously returned in my query.

Error Detection

Sadly, there’s no handy assertthat package for SQL. To prevent this error altogether in SQL code, it’s useful to avoid naming new transformed or re-categorised columns with the same name as the original column. (For e.g. if you’re not feeling particularly inventive, call the new plans column something like recategorised_plans). Giving the transformed column a different name will make it more immediately obvious that the WHERE command occurs before SELECT, as your SQL server will return an error..

Sometimes - and particularly if you’re writing ETL/ELT code, you may still want to implement tests as a more foolproof way of avoiding this error. In this case, I suggest using dbt (data build tool), an analytics engineering tool which, among many other great features, enables you to write tests on your SQL code.

For instance, in this example, I could include tests with dbt to ensure that there are no null or empty plans in the data returned by the query.

Other Checks and Assertations

The errors I’ve described above are the ones I’ve noticed crop up the most/the ones I’ve found most pernicious and annoying to identify. However, it’s a totally subjective list of my personal bugbears and by no means an exhaustive list of potential data cleaning errors.

Below is a list of other useful tests/assertions you might want to check for in your data:

  • Assert that an ID column contains only unique values
  • Assert that there are no duplicate rows in the table
  • Assert that a field has no NAs/NULLs
  • Assert that values fall within a specified range (for e.g. there are no purchase amounts < 0)
  • Assert that a field has the expected type

I’d love to hear thoughts about other errors that people have come across, and/or tests that they implement in their data cleaning code.

I’d also love to hear how people are implementing tests in the analysis code so hit me up if you have any thoughts big or small!