-
Title:
Downloading Enron Data - Intro to Machine Learning
-
Description:
-
Now that I've defined a person of interest it's time to get our
-
hands dirty with the dataset.
-
So here's the path that I followed to find the dataset.
-
I started on Google as I always do.
-
My Google for Enron emails.
-
The first thing that pops up is the Enron Email Dataset.
-
You can see that it's a dataset that's, that's very famous.
-
Many have gone before us in studying this for many different purposes.
-
It has it's own Wikipedia page.
-
It has a really interesting article that I recommend you
-
read from the MIT Technology Review about the many uses of
-
this dataset throughout the years.
-
But the first link is the dataset itself.
-
Let's follow that link.
-
This takes us to a page from the Carnegie Mellon CS department.
-
It gives a little bit of background on the dataset and
-
if we scroll down a little ways.
-
We see this link right here.
-
This is the actual link to the dataset.
-
Below that is a little bit more information.
-
If you click on this it's going to download a TGZ file.
-
Which you can see I've already downloaded down here.
-
If you do this on your own.
-
It took me nearly half an hour to download the whole dataset.
-
So I recommend that you start the download and
-
then walk away to do something else.
-
Once you have the data set you'll need to unzip it.
-
So move it into the directory where you want to work and
-
then you can run a command like this.
-
There's no real magic here I just googled how to unzip .tgz and
-
found a command like this.
-
Again this will take a few minutes.
-
When you're done with that you'll get a directory called enron mail.
-
And then CD into maildir.
-
Here's the data set.
-
It's organized into a number of directories, each of which belongs to a person.
-
You see that there's so many here I can't even fit them all on one page.
-
In fact, you'll find that there's over 150 people in this dataset.
-
Each on is identified by their last name and
-
the first letter of their first name.
-
So, looking through on a very superficial level, I see Jeff Skilling.
-
Let's see if I can find Ken Lay.
-
Looks like he might be up here.
-
Yep, there's Ken Lay.
-
Of course, a whole bunch of people I've never heard of.
-
And remember, my question is,
-
how many of the persons of interest do I have emails from?
-
Do I have enough persons of interest, do I have their emails,
-
that I could start describing the patterns in those emails,
-
using supervised classification algorithms?
-
And so the way that I answered this question was,
-
again, using some work by hand basically.
-
I took my list of persons of interest and for
-
each person on that list I just looked for their name in this directory.
-
Let's go back to that, remind ourselves what it looked like.
-
You can see the annotated list here.
-
You might have been wondering what these letters were before each of the names.
-
These are my notes that I wrote to myself.
-
As to whether I actually have the inbox of each of each of these people.
-
So Ken Lay and Jeff Skilling we already found.
-
But then it started to become a little more difficult.
-
So you can see there are many, many people than I have n's next to their name.
-
And that means no I don't have, for example, Scott Yeager.
-
If I go over to the dataset, I don't see a Yeager down here.
-
So Scott Yeager is a person who I'd love to have his inbox.
-
I'd love to have some emails to in and from him, but I don't.
-
As it turns out, I don't have the email inboxes of a lot of people.
-
So I'll be honest,
-
at this point I was actually really just discouraged about the possibility of
-
using this as a project at all.
-
I think I counted something like four or five people that I had their inboxes.
-
And while that might be a few hundred emails or something like that.
-
There's really no chance that with four examples of persons of interest I
-
could start to describe the patterns of persons of interest as a whole.
-
In the next video, though, I want to give you a key insight that I
-
had that gave this project a second chance.
-
A different way of trying to access the email inboxes of
-
the persons of interest.