English subtitles

← Procedure - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 2 created 05/24/2016 by Udacity Robot.

  1. So, let's talk a little bit about our procedure here. The first thing
  2. we want to do is build a list of all carrier values. We could do
  3. that by hand, might actually be a little easier to do it that way
  4. just by looking at the HTML. We then need to build a list of
  5. airport values. Now there are a lot of values here. So what we
  6. probably want to do is actually write a
  7. little script that will actually pull those
  8. out. Okay. So all pages are going to have exactly the same list for
  9. both of these. So we can just use the browser to download an example
  10. page and pull those values out. Next, what we need to do
  11. is make HTTP requests to download all the data. I'll talk about
  12. why we want to download it all in just a minute. Then what
  13. we want to to is parse the data files. The reason why we
  14. want to do it this way, is because, in building our parser we
  15. want to make sure we're working with data that isn't going to change.
  16. And after the fact, once we do a little bit of data
  17. cleaning, we may discover that the reason why we've got some dirty
  18. data is actually because we have a bug in our parser. Much easier
  19. to figure out where that bug is if we've still got the original data
  20. we were using to parse. I should also point out that it really
  21. doesn't make sense to download the data over and over again as we're figuring
  22. out how to parse it. Something else you might want to keep in
  23. mind is that for years prior to the current year the data isn't going
  24. to change, so there's no reason to retrieve it more than once. So this
  25. is actually a bit of a best practice. When you've got a situation like
  26. the one we have here and when you've got a scraping
  27. task it's often going to look something like this. You really want
  28. to grab all the data you need first and then do your
  29. scraping into separate process. So what we have for this particular problem
  30. is essentially three different steps. We first have to build all the
  31. values we're going to use to make HTTP request. We then need
  32. to make all the HTTP request, and download the data we need.
  33. And then finally, we're going to parse the data we want out
  34. of those data files, shaping it into the particular pieces
  35. of data, the particular items that we want to use.