-
Title:
Auditing Uniformity - Data Wranging with MongoDB
-
Description:
-
Okay let's talk about our last data quality metric that
-
being uniformity. And we're going to look at auditing in
-
particular field for uniformity. So if you remember, uniformity is
-
about all the values in the field using the same
-
units of measurement. Let's look at an example. So here
-
we're going to work with cities data set again. And
-
what I want to explore here is just one field,
-
that being the latitude field. Now the latitude field is identified
-
in this data set using this particular field name.
-
So let's take a look at some of the auditing
-
tasks we might do here. Now the way that I've
-
organized this code is that we're going to loop through
-
each of the rows in this data file, and again
-
here we're using our tft module in Python. For each
-
row we're going to call this function audit_float_field. So this
-
particular piece of code is something that we can actually
-
use to parse any field that should have a floating
-
point value. In general this is how I like to
-
think about auditing fields in my data sets. I like
-
to think about the things in general that can go
-
wrong with a particular type of data field. And do
-
some auditing for that type and then if I need
-
to, I can write more specific auditing routines to check
-
the values. Okay, let's take a look at that audit_float_field function.
-
This is where all of the real work happens here.
-
So, what I'm going to do is I'm going to keep
-
track of the number of nulls that I find, the
-
number of empty fields, if any, and then the number of
-
field values that are actually arrays. And if you remember
-
arrays are encoding using curly braces and vertical bars to
-
separate the individual elements of arrays IN the info box
-
data set. I'm also going to check to make sure that
-
the value is actually a number. And then if it is, I'm going to run a check to
-
make sure that, it falls within the minimum and
-
maximum values, okay? So, this is a way of making
-
sure that it's using the units of measurement that
-
I expect. And if you remember earlier, we saw
-
an example where the area for a city is
-
actually represented using square millimeters as opposed to square kilometers.
-
So, what I'm doing in this particular piece of code, is
-
actually hard coding in some values for this particular field. Now what
-
I would do here, if I wasn't using this as an example
-
for this course, is I would actually treat each of these as
-
command-line parameters that I would input to this script. Here I'm
-
just going to hard code them in. So, if I wanted to
-
actually use this for a different field what I would do is
-
change the field name and change the min and max values to
-
test for a different float field. Okay. So, going back
-
to our audit_float_field function, again, we're checking for nulls, empties,
-
arrays, any fields that are not in fact a number,
-
once we've made it through all these tests. And then finally,
-
if I get down to here, I've got something that
-
I believe to be a number. What I'm going to do is
-
actually convert it to a floating point value, because of
-
course, all the values coming in are strings, and then I'm
-
going to check its range. Okay, now the range for latitude, the
-
way this data should be encoded is between negative 90 and positive
-
90 and technically speaking I should have made this less than or
-
equal to. Okay. So let's run this, and see what pops up.
-
Okay? So I found three non numbers. And you can see this looks
-
like an okay latitude value, just expressed in a different type of unit.
-
Total number of cities, that's what I expect. Quite a few nulls, actually.
-
Okay, not much we're going to do about that in this particular example. And
-
quite a few arrays. If I wanted to fully audit this, I would
-
need to take a look at these arrays and see what's going on
-
there. And then I would need to check each of the individual values
-
in those arrays. What I'm more concerned about, in this particular example, are
-
these. Now there are several different ways of representing
-
geographic coordinates. This is three examples where instead of
-
having the raw values for latitude and longitude, we've
-
instead got this type of coordinate, which is actually
-
degrees, minutes and seconds. So a different way of
-
encoding the same information for latitude. If I change
-
this code slightly we'll get a chance to see
-
what the bulk of the values actually look like.
-
Okay, and you can see that they are all values between negative 90
-
and positive 90 and there we can see a few negative values as well.
-
So, commenting those out, running this again. What's going on
-
with these values? Well, the story could be that these numbers
-
were coded by hand using a different coordinate system, and
-
that's actually why we're seeing this come out rather than this
-
type of number which is what we expect. So this
-
is the type of thing we might see when we're auditing
-
for uniformity. We've got a single field that holds a
-
particular type of data, in this case, latitude values for the
-
location of cities. But there's two different coordinate
-
systems being used here. The decimal degrees latitude and
-
latitude represented in degrees, minutes and seconds. Now,
-
in interest of full disclosure, I actually made the
-
data set dirty by introducing these three values.
-
But this is exactly the type of thing you
-
might expect to see in terms of the
-
same type of value being represented using different units.