English subtitles

← Downloading Enron Data - Intro to Machine Learning

Get Embed Code
4 Languages

Showing Revision 3 created 05/25/2016 by Udacity Robot.

  1. Now that I've defined a person of interest it's time to get our
  2. hands dirty with the dataset.
  3. So here's the path that I followed to find the dataset.
  4. I started on Google as I always do.
  5. My Google for Enron emails.
  6. The first thing that pops up is the Enron Email Dataset.
  7. You can see that it's a dataset that's, that's very famous.
  8. Many have gone before us in studying this for many different purposes.
  9. It has it's own Wikipedia page.
  10. It has a really interesting article that I recommend you
  11. read from the MIT Technology Review about the many uses of
  12. this dataset throughout the years.
  13. But the first link is the dataset itself.
  14. Let's follow that link.
  15. This takes us to a page from the Carnegie Mellon CS department.
  16. It gives a little bit of background on the dataset and
  17. if we scroll down a little ways.
  18. We see this link right here.
  19. This is the actual link to the dataset.
  20. Below that is a little bit more information.
  21. If you click on this it's going to download a TGZ file.
  22. Which you can see I've already downloaded down here.
  23. If you do this on your own.
  24. It took me nearly half an hour to download the whole dataset.
  25. So I recommend that you start the download and
  26. then walk away to do something else.
  27. Once you have the data set you'll need to unzip it.
  28. So move it into the directory where you want to work and
  29. then you can run a command like this.
  30. There's no real magic here I just googled how to unzip .tgz and
  31. found a command like this.
  32. Again this will take a few minutes.
  33. When you're done with that you'll get a directory called enron mail.
  34. And then CD into maildir.
  35. Here's the data set.
  36. It's organized into a number of directories, each of which belongs to a person.
  37. You see that there's so many here I can't even fit them all on one page.
  38. In fact, you'll find that there's over 150 people in this dataset.
  39. Each on is identified by their last name and
  40. the first letter of their first name.
  41. So, looking through on a very superficial level, I see Jeff Skilling.
  42. Let's see if I can find Ken Lay.
  43. Looks like he might be up here.
  44. Yep, there's Ken Lay.
  45. Of course, a whole bunch of people I've never heard of.
  46. And remember, my question is,
  47. how many of the persons of interest do I have emails from?
  48. Do I have enough persons of interest, do I have their emails,
  49. that I could start describing the patterns in those emails,
  50. using supervised classification algorithms?
  51. And so the way that I answered this question was,
  52. again, using some work by hand basically.
  53. I took my list of persons of interest and for
  54. each person on that list I just looked for their name in this directory.
  55. Let's go back to that, remind ourselves what it looked like.
  56. You can see the annotated list here.
  57. You might have been wondering what these letters were before each of the names.
  58. These are my notes that I wrote to myself.
  59. As to whether I actually have the inbox of each of each of these people.
  60. So Ken Lay and Jeff Skilling we already found.
  61. But then it started to become a little more difficult.
  62. So you can see there are many, many people than I have n's next to their name.
  63. And that means no I don't have, for example, Scott Yeager.
  64. If I go over to the dataset, I don't see a Yeager down here.
  65. So Scott Yeager is a person who I'd love to have his inbox.
  66. I'd love to have some emails to in and from him, but I don't.
  67. As it turns out, I don't have the email inboxes of a lot of people.
  68. So I'll be honest,
  69. at this point I was actually really just discouraged about the possibility of
  70. using this as a project at all.
  71. I think I counted something like four or five people that I had their inboxes.
  72. And while that might be a few hundred emails or something like that.
  73. There's really no chance that with four examples of persons of interest I
  74. could start to describe the patterns of persons of interest as a whole.
  75. In the next video, though, I want to give you a key insight that I
  76. had that gave this project a second chance.
  77. A different way of trying to access the email inboxes of
  78. the persons of interest.