English subtitles

← Auditing Uniformity - Data Wranging with MongoDB

Get Embed Code
5 Languages

Showing Revision 2 created 05/24/2016 by Udacity Robot.

  1. Okay let's talk about our last data quality metric that
  2. being uniformity. And we're going to look at auditing in
  3. particular field for uniformity. So if you remember, uniformity is
  4. about all the values in the field using the same
  5. units of measurement. Let's look at an example. So here
  6. we're going to work with cities data set again. And
  7. what I want to explore here is just one field,
  8. that being the latitude field. Now the latitude field is identified
  9. in this data set using this particular field name.
  10. So let's take a look at some of the auditing
  11. tasks we might do here. Now the way that I've
  12. organized this code is that we're going to loop through
  13. each of the rows in this data file, and again
  14. here we're using our tft module in Python. For each
  15. row we're going to call this function audit_float_field. So this
  16. particular piece of code is something that we can actually
  17. use to parse any field that should have a floating
  18. point value. In general this is how I like to
  19. think about auditing fields in my data sets. I like
  20. to think about the things in general that can go
  21. wrong with a particular type of data field. And do
  22. some auditing for that type and then if I need
  23. to, I can write more specific auditing routines to check
  24. the values. Okay, let's take a look at that audit_float_field function.
  25. This is where all of the real work happens here.
  26. So, what I'm going to do is I'm going to keep
  27. track of the number of nulls that I find, the
  28. number of empty fields, if any, and then the number of
  29. field values that are actually arrays. And if you remember
  30. arrays are encoding using curly braces and vertical bars to
  31. separate the individual elements of arrays IN the info box
  32. data set. I'm also going to check to make sure that
  33. the value is actually a number. And then if it is, I'm going to run a check to
  34. make sure that, it falls within the minimum and
  35. maximum values, okay? So, this is a way of making
  36. sure that it's using the units of measurement that
  37. I expect. And if you remember earlier, we saw
  38. an example where the area for a city is
  39. actually represented using square millimeters as opposed to square kilometers.
  40. So, what I'm doing in this particular piece of code, is
  41. actually hard coding in some values for this particular field. Now what
  42. I would do here, if I wasn't using this as an example
  43. for this course, is I would actually treat each of these as
  44. command-line parameters that I would input to this script. Here I'm
  45. just going to hard code them in. So, if I wanted to
  46. actually use this for a different field what I would do is
  47. change the field name and change the min and max values to
  48. test for a different float field. Okay. So, going back
  49. to our audit_float_field function, again, we're checking for nulls, empties,
  50. arrays, any fields that are not in fact a number,
  51. once we've made it through all these tests. And then finally,
  52. if I get down to here, I've got something that
  53. I believe to be a number. What I'm going to do is
  54. actually convert it to a floating point value, because of
  55. course, all the values coming in are strings, and then I'm
  56. going to check its range. Okay, now the range for latitude, the
  57. way this data should be encoded is between negative 90 and positive
  58. 90 and technically speaking I should have made this less than or
  59. equal to. Okay. So let's run this, and see what pops up.
  60. Okay? So I found three non numbers. And you can see this looks
  61. like an okay latitude value, just expressed in a different type of unit.
  62. Total number of cities, that's what I expect. Quite a few nulls, actually.
  63. Okay, not much we're going to do about that in this particular example. And
  64. quite a few arrays. If I wanted to fully audit this, I would
  65. need to take a look at these arrays and see what's going on
  66. there. And then I would need to check each of the individual values
  67. in those arrays. What I'm more concerned about, in this particular example, are
  68. these. Now there are several different ways of representing
  69. geographic coordinates. This is three examples where instead of
  70. having the raw values for latitude and longitude, we've
  71. instead got this type of coordinate, which is actually
  72. degrees, minutes and seconds. So a different way of
  73. encoding the same information for latitude. If I change
  74. this code slightly we'll get a chance to see
  75. what the bulk of the values actually look like.
  76. Okay, and you can see that they are all values between negative 90
  77. and positive 90 and there we can see a few negative values as well.
  78. So, commenting those out, running this again. What's going on
  79. with these values? Well, the story could be that these numbers
  80. were coded by hand using a different coordinate system, and
  81. that's actually why we're seeing this come out rather than this
  82. type of number which is what we expect. So this
  83. is the type of thing we might see when we're auditing
  84. for uniformity. We've got a single field that holds a
  85. particular type of data, in this case, latitude values for the
  86. location of cities. But there's two different coordinate
  87. systems being used here. The decimal degrees latitude and
  88. latitude represented in degrees, minutes and seconds. Now,
  89. in interest of full disclosure, I actually made the
  90. data set dirty by introducing these three values.
  91. But this is exactly the type of thing you
  92. might expect to see in terms of the
  93. same type of value being represented using different units.