RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider

Edit subtitles

0:17 - 0:18

TODD SCHNEIDER: All right. We're, we're good.
Thank you.
0:18 - 0:20

Sorry for the delay. Classic.
0:20 - 0:22

Even in the future nothing works. Welcome.
0:22 - 0:26

I am Todd. I'm an engineer at Rap Genius.
0:26 - 0:32

And today's talk is going to be about data
science with a live tutorial.
0:32 - 0:34

And before we get into the live coding component,
0:34 - 0:36

I wanted to show you all a project I
0:36 - 0:39

built previously, which kind of serves as
the inspiration
0:39 - 0:41

for this talk. Sort of. So this is a
0:41 - 0:45

website called weddingcrunchers dot com. What
is Wedding Crunchers?
0:45 - 0:48

It's a place where you can track the, the
0:48 - 0:51

popularity of words and phrases in the New
York
0:51 - 0:54

Times wedding section over the past thirty-some
years.
0:54 - 0:56

And a lot of you might be wondering why
0:56 - 0:59

on earth would this be interesting or relevant
or
0:59 - 1:02

funny or anything, and I hope to convince
you
1:02 - 1:04

of that very quickly. Here is a, a example
1:04 - 1:07

wedding announcement from the New York Times.
This one's
1:07 - 1:08

from 1985.
1:08 - 1:09

If you don't know me, you don't live in
1:09 - 1:11

New York, read the New York Times, the wedding
1:11 - 1:14

section is a certain cultural cache. It's
kind of
1:14 - 1:16

an honor to be listed in there and it's
1:16 - 1:19

got a very resume-like structure. People get
to brag
1:19 - 1:20

about where they went to school and what they
1:20 - 1:21

do.
1:21 - 1:23

So here is an example. You know, Diane deCordova
1:23 - 1:25

is marrying Michael Monro Lewis. They both
went to
1:25 - 1:28

Princeton. They graduated Cum Laude. You know,
she works
1:28 - 1:30

at Morgan Stanley. He works at Solomon Brothers
in
1:30 - 1:33

New York and they're gonna go to London. And
1:33 - 1:34

this should be a little familiar to a bunch
1:34 - 1:35

of you.
1:35 - 1:38

Mr. Lewis and associates Solomon Brothers
is Michael Lewis.
1:38 - 1:41

He's given you Right Lawyers Poker??, famous
book about
1:41 - 1:43

his experience there. And before, before he
was a
1:43 - 1:46

famous writer, he was just another New York
Times
1:46 - 1:50

wedding announced person.
1:50 - 1:52

And so what Wedding Crunchers does is it takes
1:52 - 1:55

the entire corpus of New York Times wedding
announcements
1:55 - 1:57

back from 1981 and you can searh for words
1:57 - 2:00

and phrases and you can see how common those
2:00 - 2:02

words and phrases are, you know, by year.
It's
2:02 - 2:03

like, this is a good one that's relevant to
2:03 - 2:06

people here. You know, banker and programmer.
You know,
2:06 - 2:09

for example, when you list so-and-so is a
banker
2:09 - 2:12

or is a programmer in the announcement and
you
2:12 - 2:14

see, over time, you know, banker used to be
2:14 - 2:18

way more commonly used than programmer in
these announcements.
2:18 - 2:21

And only just this year, in 2014, programmer
has
2:21 - 2:28

finally overtaken banker as, you know, the,
the place,
2:28 - 2:30

you know, the people getting married in New
York,
2:30 - 2:33

who are part of society, come from. Another
good
2:33 - 2:35

one is, if you look at goldman, sachs and
2:35 - 2:38

google- is my internet on? Good.
2:38 - 2:41

So here's another good one. So Goldman Sachs,
you
2:41 - 2:44

know, classic New York financial instition.
Google, new kid
2:44 - 2:47

on the block. Tech scene. Boom. Taking over.
2:47 - 2:50

And, you know, this is obviously fun, and
it's
2:50 - 2:52

amusing. But it's also actually pretty insightful
for a
2:52 - 2:56

relatively simple concept. I mean, this one
graph tells
2:56 - 2:59

a pretty powerful story of, you know, New
York
2:59 - 3:02

the, the finance capitol of the world. Meanwhile,
we
3:02 - 3:04

have this sort of emerging tech scene. You
know,
3:04 - 3:05

Google may be the biggest player in the kind
3:05 - 3:07

of new tech world.
3:07 - 3:10

And now, when you turn to the society pages
3:10 - 3:11

to see who's getting married, you know, there's
more
3:11 - 3:14

employees from Google than there are from
Gullman Sachs.
3:14 - 3:17

And that, you know, kind of interesting thing
in
3:17 - 3:18

the world.
3:18 - 3:20

And so what we're gonna do today is build
3:20 - 3:25

something just like Wedding Crunchers, except,
instead of using
3:25 - 3:28

the text of wedding announcements to analyze,
we're going
3:28 - 3:33

to look at all of the RailsConf talk abstracts.
3:33 - 3:34

And so, you know, hopefully this is, this
is
3:34 - 3:37

interesting to people here and, I always say,
you
3:37 - 3:39

know, if there's only one thing you take from
3:39 - 3:41

this talk, really, what it should be is that,
3:41 - 3:44

you know, work on a problem that's interesting
to
3:44 - 3:46

you. Because, especially when you're dealing
with data science,
3:46 - 3:48

a lot of it's pretty messy and then you
3:48 - 3:49

have to go through scraping stuff as we'll
get
3:49 - 3:52

into, and it's easy to get frustrated and
kind
3:52 - 3:54

of lost and like, if you're not working on
3:54 - 3:55

something that you care about, and something
that you
3:55 - 3:58

really want to know, kind of, the final result,
3:58 - 4:00

it's just much easier to get distracted and
kind
4:00 - 4:01

of, ultimately, bail.
4:01 - 4:04

So, again, if you take one thing, just work
4:04 - 4:08

on something that is interesting to you. So
the
4:08 - 4:10

particular kind of analysis we're gonna do
is something
4:10 - 4:13

called n-gram analysis. And I have a little
example
4:13 - 4:14

set up here. So what is an n-gram? You
4:14 - 4:16

may have heard the word before.
4:16 - 4:19

Really, all it means is, you know, a, a
4:19 - 4:24

consecutive words as part of a sentence. So
like,
4:24 - 4:26

examples very simple, for one simple. This
talk is
4:26 - 4:28

boring. What are the, what are the one grams
4:28 - 4:30

in this sentence? It's just the words. This,
talk,
4:30 - 4:33

is, and boring. The two grams are every pair
4:33 - 4:36

of consecutive words. This talk, talk is,
is boring,
4:36 - 4:37

and so on.
4:37 - 4:38

And so what we need to be able to
4:38 - 4:41

do in order to build, you know, a graph
4:41 - 4:43

like this, is we need to take a term
4:43 - 4:45

that's, you know, relavent to RailsConf, say
something like
4:45 - 4:47

Ember or whatever, and we need to be able
4:47 - 4:49

to look up, you know, for each year how
4:49 - 4:51

many times does this, you know, word or n-gram
4:51 - 4:54

appear in the data.
4:54 - 4:56

And so that is what we are going to
4:56 - 4:59

build. And I have this brief little outline
here.
4:59 - 5:01

There's kind of three steps. And this is pretty
5:01 - 5:05

general to, to any data project. You know,
step
5:05 - 5:07

one is gonna be just gathering the data, getting
5:07 - 5:10

it in some usable form. Step two is gonna
5:10 - 5:11

be kind of the analysis part where we do
5:11 - 5:14

the n-gram calculation. We store the results.
And then
5:14 - 5:16

step three is gonna be to create a nice
5:16 - 5:19

little front-end interface that lets us investigate,
visualize and
5:19 - 5:21

see what we've done.
5:21 - 5:23

Now unfortunately, you know, in a, in a thirty
5:23 - 5:26

minute talk we can't possibly do all of this.
5:26 - 5:29

So we're gonna focus more on items one and
5:29 - 5:31

two and less so on three, and even then
5:31 - 5:33

it's too much. So, you know, I sort of
5:33 - 5:35

used the analogy, it'll be a bit like watching
5:35 - 5:37

TV on the Food Network, where we might, you
5:37 - 5:40

know, throw something in the oven, mysteriously
something else
5:40 - 5:42

pops out of the other oven even though it's,
5:42 - 5:44

where did that come from?
5:44 - 5:46

But not to worry. Everything is also on GitHub.
5:46 - 5:48

There's a repo I'll share with you at the
5:48 - 5:50

end. So anything that we don't cover or that
5:50 - 5:52

we cover too quickly or something, you'll
be able
5:52 - 5:54

to see sort of the, the full version on
5:54 - 5:56

GitHub.
5:56 - 5:58

So let us jump in now to step one,
5:58 - 6:00

which is, you know, gathering the data. And
so
6:00 - 6:02

let's take a look back at the, the RailsConf
6:02 - 6:03

website again. So we have to figure out how
6:03 - 6:06

we're gonna model a, a RailsConf talk in our
6:06 - 6:10

database. So like, what, you know, attributes
does a,
6:10 - 6:13

do a, excuse me, does a RailsConf talk have.
6:13 - 6:14

And it's like, one thing we see is they
6:14 - 6:18

all have titles. So that looks like something.
They
6:18 - 6:20

have speakers. You know, there's this thing,
which is
6:20 - 6:23

the abstract, and then there's the bio. And
that's
6:23 - 6:25

probably it. That's probably all we need.
6:25 - 6:28

So that's pretty simple. And, you know, I
have
6:28 - 6:30

the little migration. I've already run here.
But here
6:30 - 6:32

are attributes for talks. It's just the year,
you
6:32 - 6:34

know, what, what conference were we actually
at. The
6:34 - 6:36

title of the talk, the speaker, the abstract,
and
6:36 - 6:38

the bio.
6:38 - 6:41

And so also, that's, again, pretty straightforward.
The gemfile
6:41 - 6:45

is also very simple. It's mostly pretty boiler
plate.
6:45 - 6:48

Rails 4, Ruby 2.1. The only gems I wanted
6:48 - 6:49

to call out here are, we're gonna use nokogiri
6:49 - 6:52

for, you know, fetching, or, parsing websites
and kind
6:52 - 6:54

of scraping the data we need. We're gonna
use
6:54 - 6:56

PosGres as our main data store and we're gonna
6:56 - 6:58

use redis to build these sort of index that
6:58 - 7:00

we can ultimately use to look up, you know,
7:00 - 7:02

how common a word is.
7:02 - 7:05

And so one thing that's not here is, like,
7:05 - 7:09

you know, gem fancy data algorithm. And a
lot
7:09 - 7:11

of people, this is kind of where Ruby often
7:11 - 7:13

gets a bad reputation of, you know, not being
7:13 - 7:16

supportive of scientific computing or whatever.
And other languages
7:16 - 7:19

have more, more support. But my claim is that
7:19 - 7:21

it's really not that important. You can get
a
7:21 - 7:24

ton of mileage out of very simple tools that
7:24 - 7:24

you can build yourself.
7:24 - 7:26

You know, you don't need a fancy gem or
7:26 - 7:28

any fancy algorithm. Those things are cool
too and
7:28 - 7:31

they have their place. But they're not needed
a
7:31 - 7:33

lot of the time. And, you know, Ruby is
7:33 - 7:36

a wonderful language for, especially, scraping
stuff from the
7:36 - 7:38

web. There's a ton of support there. And so
7:38 - 7:41

I don't think that the, the lack of, you
7:41 - 7:44

know, fancy algorithm gems should necessarily
be a deterrant
7:44 - 7:44

at all.
7:44 - 7:47

And so hopefully part of this talk is convincing
7:47 - 7:50

people that Ruby and Rails are actually quite
well-suited
7:50 - 7:51

to problems like this.
7:51 - 7:54

OK. So now we actually need to write some
7:54 - 7:56

code to scrape the talk. And you know, if
7:56 - 7:57

you've ever done anything like this before,
you know
7:57 - 8:00

that Chrome Inspector is your best friend.
So let's
8:00 - 8:02

fire that up. We're gonna inspect element,
and so
8:02 - 8:04

like, we actually, what we need to do now
8:04 - 8:07

is take you know, this HTML on the page
8:07 - 8:09

and turn it into a database record that we
8:09 - 8:12

can then, you know, use to our advantage later.
8:12 - 8:13

And so it looks like, you know, all the
8:13 - 8:17

talks are in these session classes. So that's
something.
8:17 - 8:20

We can look in here. This looks like something.
8:20 - 8:23

So let's make this bigger.
8:23 - 8:25

And you know it helps to, well, it's kind
8:25 - 8:29

of essential to be decent with CSS selectors
here,
8:29 - 8:32

because that's how we're going to basically
find stuff.
8:32 - 8:35

So let's see, OK, so there's eighty-one session
divs.
8:35 - 8:38

That sounds about right. I happen to know
that
8:38 - 8:42

mine is number seventy-eight, so let's, let's
look at
8:42 - 8:44

that. And so here we are. So we need
8:44 - 8:47

to, again, the, the things we're mod- or,
the
8:47 - 8:50

attributes we're storing at the title, the
speaker, the
8:50 - 8:53

abstract, and the bio. And so we're gonna
need
8:53 - 8:55

to pull these things out.
8:55 - 8:58

So let's see. It looks like the, the title
8:58 - 9:00

is in this h1 element inside the header. So
9:00 - 9:05

let's just make sure that works. You know,
header
9:05 - 9:08

h1. That looks right.
9:08 - 9:14

The, the speaker looks to be the header h2.
9:14 - 9:16

Cool.
9:16 - 9:21

Now the abstract is in this p tag, so
9:21 - 9:23

we can do something like this. But this is
9:23 - 9:26

actually not quite right. So what's wrong
with this?
9:26 - 9:30

Well, the abstract ends, you know, suited
to the
9:30 - 9:32

problem. The bio here is also in the p
9:32 - 9:35

tag. Originally a math guy. And we've actually
pulled
9:35 - 9:37

all the p-tags. So we need a way of
9:37 - 9:39

not doing that. And this is where you just
9:39 - 9:40

need to know a little bit of CSS. Not
9:40 - 9:43

very complicated. But if you use the little
greater
9:43 - 9:45

than guy, what this says is only take the
9:45 - 9:47

p tags that are immediate descendants of the
session
9:47 - 9:50

div. And so now we have, you know, only
9:50 - 9:51

the abstract.
9:51 - 9:54

And lastly, you know, the bio is just in
9:54 - 9:58

its own little section. So something like
that. Cool.
9:58 - 10:00

So that is the jQuery version of it. We
10:00 - 10:03

need to do this, though, in Ruby. And as
10:03 - 10:05

I said, this does sometimes get a little tedious.
10:05 - 10:07

But let's, let's write the code. So I have
10:07 - 10:12

this empty method - create_railsconf_2014_talks.
And also this method
10:12 - 10:15

I've written already called fetch_and_parse,
which just gets a
10:15 - 10:17

URL and sends it to nokogiri, which we can
10:17 - 10:18

then use to do our CSS selectors.
10:18 - 10:21

So let, let's just write this. So we can
10:21 - 10:27

say doc is fetch_and_parse. The url is this.
Let's
10:27 - 10:34

see if this works in the console.
10:34 - 10:41

Of course, in here. Do I have internet? Nice.
10:47 - 10:53

So we can then check the same thing. Again.
10:53 - 10:58

Looks right. Let's find my talk, which, this
part
10:58 - 10:59

I couldn't possibly tell you. When you use
the
10:59 - 11:02

nokogiri, the eq thing, you have to add two
11:02 - 11:04

from whatever jQuery does. So I'm number 80
now.
11:04 - 11:07

Don't ask me why. I couldn't possibly tell
you.
11:07 - 11:10

But maybe someone here knows. Be curious to
find
11:10 - 11:11

out.
11:11 - 11:12

AUDIENCE: ?? (00:11:13)
11:12 - 11:15

T.S.: So there it is. There's the title. So
11:15 - 11:17

let us now write some code here. We have
11:17 - 11:22

our, our document. We're gonna go through
each session.
11:22 - 11:24

The CSS method is kind of like, you know,
11:24 - 11:29

the selector for nokogiri. Each elements.
So each of
11:29 - 11:35

these we're gonna create a talk.
11:35 - 11:38

And again. So the year we already know is
11:38 - 11:45

2014. The title we're gonna say is, elm.css("header
h1").inner_text.
11:48 - 11:55

Speaker, header h2, dun nuh nuh dun nuh nuh
12:00 - 12:05

nuh. Gettin' there.
12:05 - 12:10

All right. So I think this will probably work.
12:10 - 12:14

Let's find out. And so we're back in here.
12:14 - 12:19

Just to prove to you that I'm not lying,
12:19 - 12:23

2014 dot count. There's none of them. And,
what'd
12:23 - 12:26

I call this method? This guy. Delayed::Job.
12:26 - 12:33

All right. So we just did something. Did it
12:33 - 12:40

work? Nice. We got eighty-one talks. Most
importantly, let's,
12:41 - 12:42

we have my talk. That's the, that's the only
12:42 - 12:47

one that matters anyway. And so, you know,
you
12:47 - 12:48

might be thinking now, like, you know, what
the
12:48 - 12:50

heck, I came to the, the data science talk,
12:50 - 12:52

not the scraping talk. You know, to that,
I
12:52 - 12:56

would say, tough luck. They're the same thing.
You
12:56 - 12:58

know, you might not, you might not want to
12:58 - 13:00

hear it, but guess what, this is usually the
13:00 - 13:02

most important part of the entire project.
13:02 - 13:05

It's the hardest part, you know, because guess
what,
13:05 - 13:07

just because we got the 2014 talks, you know,
13:07 - 13:09

now we have to get the 2013 talks. And
13:09 - 13:11

the 2012 talks. And they're all on different
websites.
13:11 - 13:13

They all have different structures. You know,
you're gonna
13:13 - 13:15

have to write different code to get each type
13:15 - 13:17

of website. It's a pain. And this is why
13:17 - 13:19

I said earlier, you know, really make sure
you're
13:19 - 13:21

working on something you care about. Because
it's just
13:21 - 13:24

not fun to like, like, ugh, in 2008 they
13:24 - 13:27

separated the speakers and the abstracts.
And it's like,
13:27 - 13:29

it's just, it's annoying, but again, it's
the most
13:29 - 13:30

important part I would say.
13:30 - 13:33

You know, so much of data science is taking
13:33 - 13:36

data that's either unstructured or structured
in the wrong
13:36 - 13:39

format to you and, you know, getting it into
13:39 - 13:41

the way, you know, into the structure that
you
13:41 - 13:43

need to do whatever analysis you want to do.
13:43 - 13:45

So in this case, that's taking, you know,
html
13:45 - 13:48

on a page and converting it into a PosGres
13:48 - 13:49

database.
13:49 - 13:53

And so we have done that now. And again,
13:53 - 13:54

take my word that, you know, I've done this
13:54 - 13:57

for the other years as well. Back in 2007
13:57 - 14:01

and so we have a total of 497 talks
14:01 - 14:04

in here from RailsConfs over the years. And
so
14:04 - 14:07

that's cool. That's basically our dataset
that we're gonna
14:07 - 14:07

use.
14:07 - 14:09

And so we can sort of move on to,
14:09 - 14:11

you know, step two of the project here, which
14:11 - 14:14

is, you know, do the n-gram calculation and
store
14:14 - 14:17

the results. And so let's go back to talk.rb.
14:17 - 14:19

All this by the way is just in, you
14:19 - 14:22

know, app/models/talk.rb. That's where all
this code is.
14:22 - 14:26

And I have another empty method somewhere
called def
14:26 - 14:28

ngrams. And so this method, we're gonna need
to
14:28 - 14:30

give, you know, it goes on a talk. So
14:30 - 14:32

given a value of n, calculate on the ngrams
14:32 - 14:35

from that talk's abstract.
14:35 - 14:36

And so, what are we gonna do here? So
14:36 - 14:43

again, let's look at, talk dot mine. Dot abstract.
14:44 - 14:45

So here's the abstract, and we need to, you
14:45 - 14:49

know, get ngrams out of this. And so the
14:49 - 14:51

first thing, I've written a little helper
method over
14:51 - 14:54

here. Which I've just tacked on a string called
14:54 - 14:57

normalized_for_ngrams. And you know, what
does this do? Well,
14:57 - 15:00

it downcases it, cause we're gonna do case
insensitive.
15:00 - 15:02

There might be cases where you want to keep
15:02 - 15:04

case sensitivity. Whatever. Doesn't really
matter. In this case
15:04 - 15:06

we're gonna go case insensitive.
15:06 - 15:09

Squish is a nice, convenient method that will
kind
15:09 - 15:11

of standardize the white space for you. So
like,
15:11 - 15:14

if there's any trailing or leading white space,
and
15:14 - 15:17

if there's like a bunch of middle white space,
15:17 - 15:19

this will, it'll kill the beginning and ending
and
15:19 - 15:21

it'll turn anything in the middle into a single
15:21 - 15:21

space.
15:21 - 15:22

So that way you just don't have to worry
15:22 - 15:25

about things like double spaces or, you know,
other,
15:25 - 15:27

other weird things that can happen. Cause
of course
15:27 - 15:29

it's the web. Whatever can go wrong will go
15:29 - 15:32

wrong. So make sure that you're data's in
some
15:32 - 15:33

kind of standardized format.
15:33 - 15:36

And the last thing I've done is removed punctuation.
15:36 - 15:38

And the reason for that is just cause like,
15:38 - 15:40

you know, there's commas, periods, colons,
all sorts of
15:40 - 15:43

stuff like that. We don't really care about
it.
15:43 - 15:45

And so let's just kill any character that's
not
15:45 - 15:47

either a space or a word character. This is
15:47 - 15:49

kind of the, little like, Ruby special regex
thing.
15:49 - 15:53

So we're gonna kill punctuation.
15:53 - 15:54

And so we can actually just mess with this
15:54 - 15:57

in the console maybe. So let's take our little
15:57 - 16:00

example sentence. You know, this talk is boring.
And
16:00 - 16:04

let's normalize that for ngrams. OK. All it
did
16:04 - 16:08

was downcase it. And now we want to get
16:08 - 16:09

that into an array of words, which we can
16:09 - 16:13

just do with split. Cool.
16:13 - 16:17

And now there's actually this neat little
Ruby enumerable
16:17 - 16:18

thing, which I didn't know about until pretty
recently.
16:18 - 16:22

Each const, which stands for each consecutive.
And it
16:22 - 16:25

takes an argument, a single number, like two,
and
16:25 - 16:27

what this says is give me all of the,
16:27 - 16:30

you know, consecutive pairs of two. So if
we
16:30 - 16:32

to_a this, now we have this array of arrays,
16:32 - 16:34

which looks like exactly what we want.
16:34 - 16:37

This talk, talk is, and is boring. And so
16:37 - 16:38

the last thing we can do there is we
16:38 - 16:44

can just map that array to make these just
16:44 - 16:44

phrases.
16:44 - 16:47

So cool. So this is actually the entirety
of
16:47 - 16:50

our ngrams method, is just, you know, this
code
16:50 - 16:52

right here. So let's copy and paste this into
16:52 - 16:56

the old method here. So we want. We're doing
16:56 - 17:03

this on the abstract. Let's get some new lines
17:03 - 17:04

here.
17:04 - 17:10

All right, cool. So again, just to recap,
you
17:10 - 17:12

take the abstract, we normalize it, which
means, you
17:12 - 17:15

know, downcase and kill the punctuation. We
split it
17:15 - 17:17

to words. Uh, wait. Actually this should not
be
17:17 - 17:21

two. That should be n. And then we join
17:21 - 17:24

those. So let's, let's see if this worked.
17:24 - 17:31

So talk dot mine again. And one. OK. So
17:31 - 17:33

here are all the one grams, which is just
17:33 - 17:36

the sequence of words. And that looks correct.
And
17:36 - 17:42

all of the two grams. Also looks correct,
I
17:42 - 17:45

think. Yeah. To get, get a, yeah, OK, perfect.
17:45 - 17:48

And so this is kind of the, the method
17:48 - 17:51

we're gonna use to decompose these talks into
just,
17:51 - 17:54

you know, an array of words and phrases. And
17:54 - 17:56

so what is the next step, now that we
17:56 - 17:58

have this method? Well, the next step is we
17:58 - 17:59

have to build these indexes that we're actually
gonna
17:59 - 18:04

use to look up, you know, the final results.
18:04 - 18:05

And so for that, we're gonna use redis.
18:05 - 18:07

Now, we don't have sort of enough time to
18:07 - 18:11

really get totally into the details of redis.
But,
18:11 - 18:12

you know, the, the thing that we're really
gonna
18:12 - 18:15

use is the, the sorted set data structure,
which
18:15 - 18:16

I'd definitely encourage you to check out.
It's a
18:16 - 18:19

great data structure. Great feature of redis.
And so
18:19 - 18:20

what is a sorted set?
18:20 - 18:23

Well, it's got the word set in it, so
18:23 - 18:25

that tells you something. It's, you know,
unique elements.
18:25 - 18:27

And the, the neat feature of a sorted set
18:27 - 18:29

is that each element in the set also has
18:29 - 18:32

a score associated with it. So the way we
18:32 - 18:35

can use this is, remember, again, the question
I'm
18:35 - 18:37

gonna answer is, like, you know, if someone
searches
18:37 - 18:39

for Ember, you know, how many times was Ember
18:39 - 18:40

mentioned in 2007. How many times was it mentioned
18:40 - 18:42

in 2008. How many times was it mentioned in
18:42 - 18:43

2009?
18:43 - 18:45

So we're gonna have one sorted set for each
18:45 - 18:48

year, where the members of the sorted set
are
18:48 - 18:50

all the words and phrases that appeared in
RailsConf
18:50 - 18:54

talks, and the scores are the number of times
18:54 - 18:56

that those ngrams appeared.
18:56 - 18:58

And then, you know, redis is very efficient
about
18:58 - 19:00

this zscore method. You can look up. It's
like
19:00 - 19:03

this command right here would say, OK, in
the
19:03 - 19:06

sorted set for 2014, get me the score associated
19:06 - 19:09

with the member ember. And that's gonna tell
you,
19:09 - 19:12

you know, some number. Like, three or whatever.
Is
19:12 - 19:14

the number of times it gets mentioned.
19:14 - 19:16

So what we have to do is build these
19:16 - 19:19

sorted sets. One for each year again. And
again
19:19 - 19:24

I have an empty method called generate_ngram_data_by_year.
So iterate
19:24 - 19:26

through all talks from a given year, you know,
19:26 - 19:27

calculate the ngram counts and add it to the
19:27 - 19:30

appropriate redis sorted set. So let's write
that.
19:30 - 19:32

So one thing we need to do is make
19:32 - 19:34

sure we're not double counting. So if we have
19:34 - 19:37

an old sorted set sitting around, let's delete
it.
19:37 - 19:40

So let's, redis.delete year. We need to decide
what
19:40 - 19:43

values of n we're gonna use. So let's just
19:43 - 19:46

say one, two, and three, meaning we're gonna
calculate
19:46 - 19:48

all the one grams, two grams, three grams.
Anything
19:48 - 19:50

longer than that and it's sort of, like, what's
19:50 - 19:52

even the point. You're getting into pretty
specific sentences.
19:52 - 19:53

There's not gonna be a lot of repetition.
19:53 - 19:56

So now we need to iterate through each talk
19:56 - 20:03

for the given years. Where(:year => year).find_each.
And then
20:06 - 20:08

for each talk we need to iterate through each
20:08 - 20:14

value of n. And then for each value of
20:14 - 20:16

n, what do we need to do? We need
20:16 - 20:17

to calculate the ngram, so do talk dot ngrams.
20:17 - 20:19

This is the method we just wrote. We're gonna
20:19 - 20:20

pass it n.
20:20 - 20:23

Do |ngram|.
20:23 - 20:26

And then finally, we're going to add this
to
20:26 - 20:29

the relevant redis sorted set. So the command
for
20:29 - 20:30

that is redis.zincrby.
20:30 - 20:35

And this goes, you give it a year, you
20:35 - 20:39

give it a number, like one, and you give
20:39 - 20:40

it what are you incrementing.
20:40 - 20:43

OK. So let's look at this method now. We're
20:43 - 20:45

gonna take, give it a year. We're gonna go
20:45 - 20:48

through every talk from that year. We're gonna
go
20:48 - 20:51

through values of n, which is one, two and
20:51 - 20:53

three, so let's say one, OK. Get the talk.
20:53 - 20:55

Calculate all of its one grams. And then for
20:55 - 20:59

each one gram, add to the year sorted set
20:59 - 21:03

the value of one for that ngram. And then
21:03 - 21:05

do that just a bunch of times.
21:05 - 21:08

So let's see if this works.
21:08 - 21:14

Let's reload. Again to prove I'm not lying.
There's
21:14 - 21:21

nothing in redis at the moment. Oops. Gotta
do
21:21 - 21:22

talk.
21:22 - 21:29

Let's worry about those Delayed::Jobs. Perfect.
Drink break.
21:30 - 21:33

So it's going through each year now. And each
21:33 - 21:35

talk in each year, counting up all the words
21:35 - 21:39

and phrases and building our sorted sets.
And it
21:39 - 21:40

is done.
21:40 - 21:43

So let's see what we got in here now.
21:43 - 21:47

OK, cool. So we got these keys. Let's, let's
21:47 - 21:48

look into one of these. One of the nice
21:48 - 21:50

things about the sorted set is you can, of
21:50 - 21:53

course, sort by it. And so the command here
21:53 - 21:56

is zrevrange. So we can do the 2014 sorted
21:56 - 21:59

set. So this is gonna give us the top
21:59 - 22:01

ten, or actually eleven, top eleven, you know,
ngrams
22:01 - 22:04

in 2014. So let's see.
22:04 - 22:09

And we can actually add :with_scores = true.
So
22:09 - 22:12

the most common words and phrases from 2014
RailsConf
22:12 - 22:17

talk abstracts. Not very surprising. The,
to, and, a,
22:17 - 22:20

of, in, you, how. Rails. OK. Rails makes the
22:20 - 22:21

number ten.
22:21 - 22:24

So there you go.
22:24 - 22:25

Now we can also, let's just have a little
22:25 - 22:28

fun here. See what some of the sort top
22:28 - 22:30

non-trivial ones are. Obviously you could
write some code,
22:30 - 22:33

maybe kill stop words. Stuff like that. If
you
22:33 - 22:35

don't care about them.
22:35 - 22:40

But, so. Rails. Can code. This talk. Most
popular
22:40 - 22:45

two-word phrase. Pretty good. How to. Ruby
developers. Eh,
22:45 - 22:46

this looks pretty, pretty relevant, right.
I mean, these
22:46 - 22:51

are not words you'd be surprised to see in
22:51 - 22:53

a RailsConf talk abstract.
22:53 - 22:56

So those, you know, are the most common words.
22:56 - 22:57

So we now have this. We have this for
22:57 - 22:59

every year, by the way. So we can also
22:59 - 23:01

do something, this is the same thing for 2011.
23:01 - 23:04

Whatever. And the last piece of code we're
going
23:04 - 23:06

to write, is we need to be able to
23:06 - 23:07

query this data.
23:07 - 23:09

So, you know, the actual, sort of, website
or
23:09 - 23:12

finished product, you're gonna have to, you
know, search
23:12 - 23:13

for a term. And you're gonna have to go
23:13 - 23:16

look up in your data, you know, what, what
23:16 - 23:19

are the relevant values for that term.
23:19 - 23:21

And so, how we're gonna do this. Well, the
23:21 - 23:23

first thing we gotta remember is that we normal-
23:23 - 23:27

remember we did this normalize for ngrams
thing. So
23:27 - 23:29

we have to do that again, because what if
23:29 - 23:31

someone searches for a capitalized word or
with something
23:31 - 23:33

with punctuation. We have to process it the
exact
23:33 - 23:36

same way that we processed our input. Otherwise
it
23:36 - 23:39

won't match. So let's just do that.
23:39 - 23:43

And then we have this constant ALL_YEARS.
And we're
23:43 - 23:46

gonna iterate through that with an object
with a
23:46 - 23:47

hash. Let's just build up a hash. That's probably
23:47 - 23:52

the easy way to do it. Do |year, hash|.
23:52 - 23:58

And the, the relevant redis command, again,
is zscore.
23:58 - 24:04

So we can do redis dot zscore(). We're gonna
24:04 - 24:06

look up in the hash for that year, the
24:06 - 24:08

term. And we need to put this actually in
24:08 - 24:14

the hash. And so, and then we need to
24:14 - 24:16

to_i that in case it's nil.
24:16 - 24:19

OK. So this now, what does this say? ALL_YEARS
24:19 - 24:23

is just, you know, 2007 through 2014. Go through
24:23 - 24:26

each of those years. And then build up our
24:26 - 24:28

hash so that the hash, the key of the
24:28 - 24:30

year, maps to the value of, you know, the
24:30 - 24:34

number of times that term appeared in that
year.
24:34 - 24:38

So let's, again, see if that works. Talk dot
24:38 - 24:44

query, you know, ruby or something. Cool.
So in
24:44 - 24:47

2007 it was mentioned 52 times, 2014 22 times.
24:47 - 24:50

Whatever. We can, I guess, we said Ember originally.
24:50 - 24:54

And there you go. It was not mentioned until
24:54 - 24:58

this year. Which is also kind of telling.
24:58 - 25:02

And so this is basically, you know, all of
25:02 - 25:04

the kind of step two code you need. That's
25:04 - 25:07

sort of the ngram calculation, store the results.
And
25:07 - 25:10

again, I reiterate, like, everything we just
did, is
25:10 - 25:13

kind of trivially simple. There's no fancy
algorithms. It's
25:13 - 25:15

just counting, you know, putting stuff in
the right
25:15 - 25:17

data structure. Accessing it in sort of the
right
25:17 - 25:18

way.
25:18 - 25:21

And I just think there's something like pretty,
you
25:21 - 25:23

know, insightful about that, that you don't
need to
25:23 - 25:26

do fancy things all the time. And that often
25:26 - 25:29

the kind of the coolest results will come
from
25:29 - 25:31

something simple.
25:31 - 25:32

And so, as I said, the last thing we're
25:32 - 25:33

gonna do here is create this nice front end
25:33 - 25:36

interface that lets us investigate the results.
You know,
25:36 - 25:38

unfortunately, we don't really have time to
get into
25:38 - 25:40

that. It is all on the GitHub. But, I
25:40 - 25:43

will tell you, I use pie charts as a
25:43 - 25:46

nice library, front-end library that makes
it very simple
25:46 - 25:47

to get charts up and running. It's actually
not
25:47 - 25:48

that much code.
25:48 - 25:50

And I've done this already. So let's start
up
25:50 - 25:54

a server. And, oops. Let's fire up the localhost.
25:54 - 25:59

And so here we are. The abstractogram is our
25:59 - 26:00

app. So what are we, what are we gonna
26:00 - 26:01

search for here?
26:01 - 26:04

Let's see. I, you, we or something. And there
26:04 - 26:05

we go. So there, there it is. The number
26:05 - 26:09

of times the word you appears in each year.
26:09 - 26:11

Looks pretty flat. So, you know, the, these
are
26:11 - 26:13

kind of constant. Anyone have any, anything
else they
26:13 - 26:16

want to search for? Let's try ember, backbone.
26:16 - 26:19

All right. Let's say, we got, PosGres I heard.
26:19 - 26:24

All right. I guess we could all say, let's
26:24 - 26:29

say SQL. No one cares about PosGres this year.
26:29 - 26:33

Service. SOA. Oh, there is sort of a rising
26:33 - 26:36

trend of service-oriented architecture.
26:36 - 26:36

Anything else?
26:36 - 26:41

TDD. That's a good one. TDD. Testing. Test-driven,
how
26:41 - 26:48

about. So there we go. I'm sorry?
26:49 - 26:54

Rest. That's a trick one though, cause rest
is
26:54 - 26:55

also like a real word that, you know, like,
26:55 - 26:57

the rest of the time will be something else.
26:57 - 27:04

And. Refactor. Let's see. Ooh. That's a good
one.
27:04 - 27:10

DHH. Wow. Peaked 2011, peak DHH. Let's see,
we
27:10 - 27:12

got, Heroku is a good one. On the rise.
27:12 - 27:14

I like we can just look at Ruby and
27:14 - 27:15

Rails. This is actually, I think, pretty relevant.
It's
27:15 - 27:19

like, what are people talking about? Not Rails
anymore.
27:19 - 27:20

We got to find something new to talk about.
27:20 - 27:23

You know, it's like, too many RailsConfs.
And, in
27:23 - 27:25

fact, this actually came up at the, you know,
27:25 - 27:27

there was a speaker meeting, whatever, and
everyone was
27:27 - 27:29

talking about how, you know, their talks weren't
actually
27:29 - 27:31

about Rails.
27:31 - 27:33

And, you know, maybe this is actually an insightful
27:33 - 27:36

statement, that, you know, the, the community
has obviously
27:36 - 27:38

gotten very large and there's just a ton of
27:38 - 27:38

other stuff to talk about. People have been
talking
27:38 - 27:41

about Rails for a long time. And so, you
27:41 - 27:43

know, here I am giving a talk that's not
27:43 - 27:46

really directly about Rails. But, so maybe
this is
27:46 - 27:47

like a real trend that people are just finding
27:47 - 27:49

other stuff to talk about.
27:49 - 27:53

And that is pretty cool. So I promised that
27:53 - 27:56

I would show you the repo or whatever on
27:56 - 28:00

GitHub. You can just do bit.ly slash railsconfdata.
It's
28:00 - 28:02

just the code. Everything we've looked at
today. Plus
28:02 - 28:04

some more stuff. It's actually running live
on the
28:04 - 28:07

internet at abstractogram dot herokuapp dot
com.
28:07 - 28:10

I figure the internet's probably not working,
but let's
28:10 - 28:17

see. Yup. Classic. And, you know, otherwise
that is
28:17 - 28:20

it. And thank you for listening. And I think
28:20 - 28:20

we have time for questions.

Title:: RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider
Description:: more » « less
Duration:: 28:48

Amara Bot edited English subtitles for RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider

English subtitles

Revisions

Revision 1 Imported

Amara Bot

RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider

Revisions

Our website uses cookies

Operating cookies (Required)