Data cleaning is a term which refers to a variety of highly applied activities that are intended to ensure that the data are not, in some way, incorrect.
A “clean” dataset is one in which every variable is labeled appropriately, and organized in a way that will make sense to a user. If missing data exists, there is a unified approach to dealing with it (i.e., every missing data point is coded in the same way). The data have been checked for consistency using a variety of more or less common sense approaches. It’s important to note from the outset that data-cleaning is not a specific data analytic approach so much as an attitude about the data, combined with some specific techniques. Here are some, but only some, of those techniques.
We’ll be using a deliberately “dirtied up” dataset. This is based on the caschools
data, which contains data on all of the school districts in the state of California.
Not everyone you work with will be as well-trained as you. Large datasets that are appended over time (think institutional data) also have a tendency to suffer from drift or entropy. Many times you will receive a dataset with incomplete or incomprehensible variable labels. You will need to figure out what to do with these. The labels in this example dataset are obviously a mess.
. use ${datadir}caschool_problem, replace
. // replacing problematic variable labels
. describe
Contains data from ../data/caschool_problem.dta
obs: 420
vars: 19 19 Oct 2010 10:56
size: 60,060
------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------
observation_n~r float %9.0g Observation
dist_cod float %9.0g District Code
county str18 %18s County
district str53 %53s District
gr_span str8 %8s gr
enrl_tot float %9.0g Enrollment as of census
teachers float %9.0g teachers
calw_pct float %9.0g
meal_pct float %9.0g Meals
computer float %9.0g computer
testscr float %9.0g API INDEX
comp_stu float %9.0g comp_stu
expn_stu float %9.0g Expenditures per student
str float %9.0g ratio
avginc float %9.0g avginc
el_pct float %9.0g ELL
read_scr float %9.0g
math_scr float %9.0g
foo float %9.0g
------------------------------------------------------------------------------------------
Sorted by:
. label variable gr_span "Grade Span"
Rename the variable labels to be appropriate and descriptive.
Each of the following describes techniques that can help you to find data points that may be unreliable or wrong. When you find such a data point, you need to rely on your judgment and the context of the problem to decide what to do next.
Look for outliers: values that are extraordinarily far from the mean or median (e.g., more than 3 s.d. away for approximately normal data).
Let’s take a look at student teacher ratio using the boxplot
and histogram
commands.
. graph box str, name(box_str)
. graph export ${plotdir}box_str.eps, name(box_str) replace
(note: file ../plots/box_str.eps not found)
(file ../plots/box_str.eps written in EPS format)
. // histogram str
str
. histogram str, name(hist_str)
(bin=20, start=14, width=1.6700001)
. graph export ${plotdir}hist_str.eps, name(hist_str) replace
(note: file ../plots/hist_str.eps not found)
(file ../plots/hist_str.eps written in EPS format)
str
As you can see, there are two values of student teacher ratio that are big outliers. What should we do about these?
There’s another variable that has an extreme outlier. Find this variable and decide what to do about it.
Look for values that are atypical given the context of the student. For instance, be vary of students in rural schools who report living in a town of 2 million.
You should also look for impossible values. These include: negative values for things that must be positive like income or height; test scores that are above the maximum; proportions that are negative or above one; or percentages that are negative or above 100.
Here’s a summary of the calw_pct
variable, which is expressed in percentage terms. It clearly has at least one impossible value.
. sum calw_pct
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
calw_pct | 420 13.46038 12.25822 0 102.7
There’s another variable that has at least one impossible value. Find it.
When looking at data that are in panel format, very sharp changes from the previous period may be suspect. For instance, a student who goes from the 5th percentile to the 95th percentile in test scores.
Check to make sure that the variable is in the order that you would expect in the comparison. Are there high income students that are coded as low SES? Are there students with low GPAs and high test scores? These may be correct, but you need to check for strange patterns.
To test this, let’s plot several of the variables against one another and look for problematic relationships. Here’s a plot of average income, avginc
, against the percent of students in the district on free or reduced price lunches, meal_pct
:
. graph twoway scatter avginc meal_pct, name(sc_inc_meal)
. graph export ${plotdir}sc_inc_meal.eps, name(sc_inc_meal) replace
(note: file ../plots/sc_inc_meal.eps not found)
(file ../plots/sc_inc_meal.eps written in EPS format)
There’s another implausible value based on the relationship between two variables. Find it. (Hint: what is the biggest budget item in any school district?)
Check that there aren’t logically impossible combinations of variables. For example, we should be suspicious when a parent’s age is less than that of the student. Sometimes your dataset documentation will alert you to potential problems. The codebook of the National Longitudinal Survey of Youth, 1997, helpfully explains that “researchers should note that [work hour] totals above 168 hours per week are suspect.”
Check to make sure that calculations have been done correctly. For example this dataset has several ratio measures. Let’s recalculate these and see if they are correct. In the code below we recalculate the student teacher ratio and then plot it against the original calculation.
. // create new student teacher ratio variable
. gen str_two = enrl_tot / teachers
. // twoway scatter of both student teacher ratio variables
. graph twoway scatter str_two str, name(sc_str_str_two)
. graph export ${plotdir}sc_str_str_two.eps, name(sc_str_str_two) replace
(note: file ../plots/sc_str_str_two.eps not found)
(file ../plots/sc_str_str_two.eps written in EPS format)
That’s not good. Not good at all. teacher
has a lot of the same -4 values. Maybe that means something. Clearly a school can’t have -4 teachers. Let’s recompute the ratio but only for schools with a positive number of teachers.
. replace teachers = . if teachers == -4
(67 real changes made, 67 to missing)
. // create another new student teacher ratio variable
. gen str_three = enrl_tot / teachers
(67 missing values generated)
. // twoway scatter of new new and old student teacher ratio variables
. graph twoway scatter str_three str, name(sc_str_str_three)
. graph export ${plotdir}sc_str_str_three.eps, name(sc_str_str_three) replace
(note: file ../plots/sc_str_str_three.eps not found)
(file ../plots/sc_str_str_three.eps written in EPS format)
That’s a little better, but we still clearly have some problems that we would need to deal with before a full analysis.
There are other variables with problems in their calculations. Find them.
You’ll also need to look for duplicates in variables that shouldn’t have any. The most obvious place to look is in id numbers or the equivalent. You should also check any other variable that ought to be unique. Luckily, Stata has a whole suite of duplicate commands to work with, including, the appropriately named duplicates
.
Below is an example of a variable with no duplicates, followed by a variable with a couple of duplicates.
. // check for duplicate observations
. duplicates report observation_number
Duplicates in terms of observation_number
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 420 0
--------------------------------------
. // check for duplicate district cod
. duplicates report dist_cod
Duplicates in terms of dist_cod
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 418 0
2 | 2 1
--------------------------------------
First, figure out how missing data is coded from the codebook. Code those values as missing. That’s a dot, .
, in Stata.
Next, look for impossible values as above. These should also be recoded to missing.
Now, look for zeros. Are these really zeros? What was the criteria for having a zero? Check if these should really be missing.
Your best friend here is the Stata command inspect
. Here’s the result of the inspect
command for two of the variables:
. // inspect test scores
. inspect testscr
testscr: API INDEX Number of Observations
------------------- ---------------------------------------
Total Integers Nonintegers
| # Negative - - -
| # Zero - - -
| # # Positive 420 17 403
| # # # ----------- ----------- -----------
| # # # Total 420 17 403
| # # # # . Missing -
+---------------------- -----------
605.55 706.75 420
(More than 99 unique values)
. // inspect reading scores
. inspect read_scr
read_scr: Number of Observations
----------- ---------------------------------------
Total Integers Nonintegers
| # Negative 77 77 -
| # Zero 15 15 -
| # Positive 328 33 295
| # ----------- ----------- -----------
| # Total 420 125 295
| # . . . # Missing -
+---------------------- -----------
-1 704 420
(More than 99 unique values)
The variable read_scr
has a number of negative values. This warrants futher inspection.
. histogram read_scr, name(hist_read_scr)
(bin=20, start=-1, width=35.25)
. graph export ${plotdir}hist_read_scr.eps, name(hist_read_scr) replace
>
(note: file ../plots/hist_read_scr.eps not found)
(file ../plots/hist_read_scr.eps written in EPS format)
Look for other variables with either negative or missing values and figure out what do with them. Once you’ve got the data properly coded and missing entered as missing, use the
mvpatterns
command to figure out if there are systematic problems with your data.
All of the above is only to get you started. There is no cookbook way to approach data cleaning. The idea is to get to know the data well enough that you know whether anomalies are just strange, but true, or problems in the data themselves.
Init: 25 August 2015; Updated: 25 August 2015