English subtitles

← Using CSV Module - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 5 created 05/24/2016 by Udacity Robot.

  1. Now let's step back a minute and think a little bit more
  2. about CSV. We've talked about the fact that fields are delimited by
  3. commas. So what happens if we have a field that actually has
  4. a comma in it, like for example, this one. This particular Beatles
  5. album was released on two different labels, one in New Zealand and
  6. one in the US, so the way this data set has been
  7. set up, those two different labels are simply separated by a comma
  8. here. Now based on what we know so far about CSV, or
  9. I should say what we've discussed so far in the
  10. class about CSV, this would cause a problem for us
  11. because our parser would interpret this as a field separator.
  12. Now, the way that the CSV format actually handles this, or
  13. the way that most applications that deal with CSV format
  14. actually handle this, is to do something like the following. So
  15. you can see here that this is the field in
  16. this actual CSV file. Over here what I've done is simply
  17. load it inside a Google sheet, but here's a raw
  18. CSV file. And you can see that the way this
  19. has been structured is for this particular line, this field
  20. has been enclosed in quotes. Okay? So what that does is
  21. indicates that you can ignore field delimiters from here to
  22. here. So, we've got some choices in terms of what
  23. we use for quotes, you could use double quotes or
  24. you can use single quotes. Well, that would cause a problems
  25. in other ways. You can see here that we have this
  26. quote character here, single quote here. There's also one here, which
  27. is actually used as an apostrophe for Sgt Pepper's Lonely Hearts
  28. Club Band. So it would be extremely tedious if in our
  29. Python programs we had to deal with all of these different
  30. variations of exceptions. And the fact is that though we call
  31. it CSV, or comma separated values, you can really use any
  32. delimiter you want here, as long as that character is only used
  33. for a field delimiter in rows in our dataset.
  34. So as, so often is the case of software
  35. development, this problem has been abstracted away and solved
  36. for all of the different variations, the tedious details that
  37. we might have to deal with in order to work with the format like CSV that has so
  38. many variations, and asterisks as my friend, Will Cross
  39. is fond of saying. This is the Python CSV module.
  40. This module deals with CSV formats in a pretty complete
  41. way. So, let's look at how we use this module. Now,
  42. what I'm actually going to do here is use the DictReader
  43. class from this module. This assumes that what we want to do
  44. is read all of our data into dictionaries, which is
  45. what we've been doing all along, and what we'll kind of
  46. continue to do throughout the rest of the course. But
  47. it has some other pretty cool features as well. For example,
  48. it assumes that the first row of whatever file we are
  49. going to read is actually a header row. And that those
  50. are the names we want to use for fields. So, going
  51. back to our CSV file, if I scroll up to the
  52. top, we can see that this first row here is actually
  53. all of the field labels that we would like for the
  54. columns in this data set or the fields in this data
  55. set. So, what this dictionary reader will do for us is as
  56. it reads in rows, it will create a dictionary for
  57. each row. The field names will be whatever it found in
  58. that first row, and it remembers them as we read
  59. through the data file. And the values then will in turn
  60. be each of the associated values on each line of
  61. the file. And again, it also handles things like dealing with
  62. quote characters, dealing with quoted fields that may commas inside
  63. of them, and so on. We don't have to worry about
  64. that at all, using the CSV module. So let's take a
  65. look at the rest of this code. Essentially we're just opening
  66. up the data file. We're instantiating a DictReader from the CSV
  67. module, and then we're simply looping through. Each time through here,
  68. this class is going to produce a line for us. And
  69. that line will actually be a dictionary, composed of the appropriate
  70. fields for that particular line. So then if we scroll down,
  71. what I'm going to do here is simply print out all of those
  72. values, okay? So, let's take a look at running
  73. this piece of code. And again, remember we're using
  74. the CSV module. Okay, so if we run this,
  75. the output we get, we'll just look at the second
  76. to last one here, is a dictionary composed of
  77. each of the labels that came out from the first
  78. line of this file. And the field value for
  79. each one of the fields for this particular row from
  80. the data file. Okay? And it seamlessly handles for
  81. us fields that may be quoted on a particular
  82. line, and other nuances that we might see in
  83. the CSV format, and conveniently stuffs everything into individual dictionaries
  84. for us. So, whenever you're working with CSV files
  85. in Python, it's best to use the CSV module,
  86. because so many of the challenges of working with
  87. this type of data have already been solved for us.