YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Sources of Dirty Data - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 2 created 05/25/2016 by Udacity Robot.

  1. There are lots of sources of dirty data. Basically, anytime
  2. humans are involved, there's going to be dirty data. This
  3. is a lot like any time my kids are involved,
  4. there's going to be mud tracked through the kitchen. There
  5. are lots of ways in which we touch data we
  6. work with. Let me get some hand sanitizer and then
  7. we'll get started. So, we're going to have user entry errors.
  8. In some situations, we won't have any data coding standards,
  9. or where we do have standards they'll be poorly
  10. applied, causing problems in the resulting data. We might
  11. have to integrate data where different schemas have been
  12. used for the same type of item. We'll have
  13. legacy data systems, where data wasn't coded when disc
  14. and memory constraints were much more restrictive than they
  15. are now. Over time systems evolve. Needs change, and
  16. data changes. Some of our data won't have the unique
  17. identifiers it should. Other data will be lost in transformation from one
  18. format to another. And then of course there's always programmer error. And
  19. finally, data might have been corrupted in transmission or storage by cosmic
  20. rays or other physical phenomenon. So hey, one that's not our fault.