YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Iterative Sax XML Parsing - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 2 created 05/25/2016 by Udacity Robot.

  1. Okay. Lets do an exercise. Now, your task in this exercise is to look at the
  2. Chicago OSM data set and find all the top level tags in this data set. Now,
  3. what we mean top level tags are essentially
  4. all the distinct different types of tags that
  5. you are going to see in this data set. Okay, so osm, bounds, node, tag and
  6. so on. What I'd like you to do is
  7. loop through this data set and create a dictionary such
  8. that each time you see a particular tag, if
  9. that tag isn't already in your dictionary you add it.
  10. At the end, your dictionary should be populated by
  11. all of the different types of tags contained in this
  12. data set. Now the challenge here is that this file
  13. is huge. If we take a look at its size,
  14. I did that just a little bit ago. We can
  15. see that it's just under 2 gigabytes of data. Now
  16. we've talked about two different types of XML parsing in
  17. this course. One is tree-based parsing where we essentially read
  18. the entire document into memory, and then work with it
  19. as nodes on a tree. The other way we've talked
  20. about parsing XML is using a SAX parser or doing
  21. iterative parsing. We actually looked at the iterparse method for element
  22. tree back in lesson three. And that's what you
  23. are going to be doing here. So instead of reading this
  24. entire file into memory. What we are going to do
  25. with iterparse is parse it one tag at a time
  26. and so essentially what you are doing here is
  27. treating each time you see a tag as an event
  28. and for each one of those events, what we are
  29. going to be doing is checking to see from our dictionary
  30. whether we have seen a given tag. Before and I don't mean this specific tag.
  31. Rather what I mean is a tag with this name. So bounds, node, tag, et cetera.
  32. If you haven't seen it before, create a new key in the dictionary. And by the
  33. time you're done parsing the file you'll have
  34. all of the unique tag names. Good luck!