TODD SCHNEIDER: All right. We're, we're good.
Thank you.
Sorry for the delay. Classic.
Even in the future nothing works. Welcome.
I am Todd. I'm an engineer at Rap Genius.
And today's talk is going to be about data
science with a live tutorial.
And before we get into the live coding component,
I wanted to show you all a project I
built previously, which kind of serves as
the inspiration
for this talk. Sort of. So this is a
website called weddingcrunchers dot com. What
is Wedding Crunchers?
It's a place where you can track the, the
popularity of words and phrases in the New
York
Times wedding section over the past thirty-some
years.
And a lot of you might be wondering why
on earth would this be interesting or relevant
or
funny or anything, and I hope to convince
you
of that very quickly. Here is a, a example
wedding announcement from the New York Times.
This one's
from 1985.
If you don't know me, you don't live in
New York, read the New York Times, the wedding
section is a certain cultural cache. It's
kind of
an honor to be listed in there and it's
got a very resume-like structure. People get
to brag
about where they went to school and what they
do.
So here is an example. You know, Diane deCordova
is marrying Michael Monro Lewis. They both
went to
Princeton. They graduated Cum Laude. You know,
she works
at Morgan Stanley. He works at Solomon Brothers
in
New York and they're gonna go to London. And
this should be a little familiar to a bunch
of you.
Mr. Lewis and associates Solomon Brothers
is Michael Lewis.
He's given you Right Lawyers Poker??, famous
book about
his experience there. And before, before he
was a
famous writer, he was just another New York
Times
wedding announced person.
And so what Wedding Crunchers does is it takes
the entire corpus of New York Times wedding
announcements
back from 1981 and you can searh for words
and phrases and you can see how common those
words and phrases are, you know, by year.
It's
like, this is a good one that's relevant to
people here. You know, banker and programmer.
You know,
for example, when you list so-and-so is a
banker
or is a programmer in the announcement and
you
see, over time, you know, banker used to be
way more commonly used than programmer in
these announcements.
And only just this year, in 2014, programmer
has
finally overtaken banker as, you know, the,
the place,
you know, the people getting married in New
York,
who are part of society, come from. Another
good
one is, if you look at goldman, sachs and
google- is my internet on? Good.
So here's another good one. So Goldman Sachs,
you
know, classic New York financial instition.
Google, new kid
on the block. Tech scene. Boom. Taking over.
And, you know, this is obviously fun, and
it's
amusing. But it's also actually pretty insightful
for a
relatively simple concept. I mean, this one
graph tells
a pretty powerful story of, you know, New
York
the, the finance capitol of the world. Meanwhile,
we
have this sort of emerging tech scene. You
know,
Google may be the biggest player in the kind
of new tech world.
And now, when you turn to the society pages
to see who's getting married, you know, there's
more
employees from Google than there are from
Gullman Sachs.
And that, you know, kind of interesting thing
in
the world.
And so what we're gonna do today is build
something just like Wedding Crunchers, except,
instead of using
the text of wedding announcements to analyze,
we're going
to look at all of the RailsConf talk abstracts.
And so, you know, hopefully this is, this
is
interesting to people here and, I always say,
you
know, if there's only one thing you take from
this talk, really, what it should be is that,
you know, work on a problem that's interesting
to
you. Because, especially when you're dealing
with data science,
a lot of it's pretty messy and then you
have to go through scraping stuff as we'll
get
into, and it's easy to get frustrated and
kind
of lost and like, if you're not working on
something that you care about, and something
that you
really want to know, kind of, the final result,
it's just much easier to get distracted and
kind
of, ultimately, bail.
So, again, if you take one thing, just work
on something that is interesting to you. So
the
particular kind of analysis we're gonna do
is something
called n-gram analysis. And I have a little
example
set up here. So what is an n-gram? You
may have heard the word before.
Really, all it means is, you know, a, a
consecutive words as part of a sentence. So
like,
examples very simple, for one simple. This
talk is
boring. What are the, what are the one grams
in this sentence? It's just the words. This,
talk,
is, and boring. The two grams are every pair
of consecutive words. This talk, talk is,
is boring,
and so on.
And so what we need to be able to
do in order to build, you know, a graph
like this, is we need to take a term
that's, you know, relavent to RailsConf, say
something like
Ember or whatever, and we need to be able
to look up, you know, for each year how
many times does this, you know, word or n-gram
appear in the data.
And so that is what we are going to
build. And I have this brief little outline
here.
There's kind of three steps. And this is pretty
general to, to any data project. You know,
step
one is gonna be just gathering the data, getting
it in some usable form. Step two is gonna
be kind of the analysis part where we do
the n-gram calculation. We store the results.
And then
step three is gonna be to create a nice
little front-end interface that lets us investigate,
visualize and
see what we've done.
Now unfortunately, you know, in a, in a thirty
minute talk we can't possibly do all of this.
So we're gonna focus more on items one and
two and less so on three, and even then
it's too much. So, you know, I sort of
used the analogy, it'll be a bit like watching
TV on the Food Network, where we might, you
know, throw something in the oven, mysteriously
something else
pops out of the other oven even though it's,
where did that come from?
But not to worry. Everything is also on GitHub.
There's a repo I'll share with you at the
end. So anything that we don't cover or that
we cover too quickly or something, you'll
be able
to see sort of the, the full version on
GitHub.
So let us jump in now to step one,
which is, you know, gathering the data. And
so
let's take a look back at the, the RailsConf
website again. So we have to figure out how
we're gonna model a, a RailsConf talk in our
database. So like, what, you know, attributes
does a,
do a, excuse me, does a RailsConf talk have.
And it's like, one thing we see is they
all have titles. So that looks like something.
They
have speakers. You know, there's this thing,
which is
the abstract, and then there's the bio. And
that's
probably it. That's probably all we need.
So that's pretty simple. And, you know, I
have
the little migration. I've already run here.
But here
are attributes for talks. It's just the year,
you
know, what, what conference were we actually
at. The
title of the talk, the speaker, the abstract,
and
the bio.
And so also, that's, again, pretty straightforward.
The gemfile
is also very simple. It's mostly pretty boiler
plate.
Rails 4, Ruby 2.1. The only gems I wanted
to call out here are, we're gonna use nokogiri
for, you know, fetching, or, parsing websites
and kind
of scraping the data we need. We're gonna
use
PosGres as our main data store and we're gonna
use redis to build these sort of index that
we can ultimately use to look up, you know,
how common a word is.
And so one thing that's not here is, like,
you know, gem fancy data algorithm. And a
lot
of people, this is kind of where Ruby often
gets a bad reputation of, you know, not being
supportive of scientific computing or whatever.
And other languages
have more, more support. But my claim is that
it's really not that important. You can get
a
ton of mileage out of very simple tools that
you can build yourself.
You know, you don't need a fancy gem or
any fancy algorithm. Those things are cool
too and
they have their place. But they're not needed
a
lot of the time. And, you know, Ruby is
a wonderful language for, especially, scraping
stuff from the
web. There's a ton of support there. And so
I don't think that the, the lack of, you
know, fancy algorithm gems should necessarily
be a deterrant
at all.
And so hopefully part of this talk is convincing
people that Ruby and Rails are actually quite
well-suited
to problems like this.
OK. So now we actually need to write some
code to scrape the talk. And you know, if
you've ever done anything like this before,
you know
that Chrome Inspector is your best friend.
So let's
fire that up. We're gonna inspect element,
and so
like, we actually, what we need to do now
is take you know, this HTML on the page
and turn it into a database record that we
can then, you know, use to our advantage later.
And so it looks like, you know, all the
talks are in these session classes. So that's
something.
We can look in here. This looks like something.
So let's make this bigger.
And you know it helps to, well, it's kind
of essential to be decent with CSS selectors
here,
because that's how we're going to basically
find stuff.
So let's see, OK, so there's eighty-one session
divs.
That sounds about right. I happen to know
that
mine is number seventy-eight, so let's, let's
look at
that. And so here we are. So we need
to, again, the, the things we're mod- or,
the
attributes we're storing at the title, the
speaker, the
abstract, and the bio. And so we're gonna
need
to pull these things out.
So let's see. It looks like the, the title
is in this h1 element inside the header. So
let's just make sure that works. You know,
header
h1. That looks right.
The, the speaker looks to be the header h2.
Cool.
Now the abstract is in this p tag, so
we can do something like this. But this is
actually not quite right. So what's wrong
with this?
Well, the abstract ends, you know, suited
to the
problem. The bio here is also in the p
tag. Originally a math guy. And we've actually
pulled
all the p-tags. So we need a way of
not doing that. And this is where you just
need to know a little bit of CSS. Not
very complicated. But if you use the little
greater
than guy, what this says is only take the
p tags that are immediate descendants of the
session
div. And so now we have, you know, only
the abstract.
And lastly, you know, the bio is just in
its own little section. So something like
that. Cool.
So that is the jQuery version of it. We
need to do this, though, in Ruby. And as
I said, this does sometimes get a little tedious.
But let's, let's write the code. So I have
this empty method - create_railsconf_2014_talks.
And also this method
I've written already called fetch_and_parse,
which just gets a
URL and sends it to nokogiri, which we can
then use to do our CSS selectors.
So let, let's just write this. So we can
say doc is fetch_and_parse. The url is this.
Let's
see if this works in the console.
Of course, in here. Do I have internet? Nice.
So we can then check the same thing. Again.
Looks right. Let's find my talk, which, this
part
I couldn't possibly tell you. When you use
the
nokogiri, the eq thing, you have to add two
from whatever jQuery does. So I'm number 80
now.
Don't ask me why. I couldn't possibly tell
you.
But maybe someone here knows. Be curious to
find
out.
AUDIENCE: ?? (00:11:13)
T.S.: So there it is. There's the title. So
let us now write some code here. We have
our, our document. We're gonna go through
each session.
The CSS method is kind of like, you know,
the selector for nokogiri. Each elements.
So each of
these we're gonna create a talk.
And again. So the year we already know is
2014. The title we're gonna say is, elm.css("header
h1").inner_text.
Speaker, header h2, dun nuh nuh dun nuh nuh
nuh. Gettin' there.
All right. So I think this will probably work.
Let's find out. And so we're back in here.
Just to prove to you that I'm not lying,
2014 dot count. There's none of them. And,
what'd
I call this method? This guy. Delayed::Job.
All right. So we just did something. Did it
work? Nice. We got eighty-one talks. Most
importantly, let's,
we have my talk. That's the, that's the only
one that matters anyway. And so, you know,
you
might be thinking now, like, you know, what
the
heck, I came to the, the data science talk,
not the scraping talk. You know, to that,
I
would say, tough luck. They're the same thing.
You
know, you might not, you might not want to
hear it, but guess what, this is usually the
most important part of the entire project.
It's the hardest part, you know, because guess
what,
just because we got the 2014 talks, you know,
now we have to get the 2013 talks. And
the 2012 talks. And they're all on different
websites.
They all have different structures. You know,
you're gonna
have to write different code to get each type
of website. It's a pain. And this is why
I said earlier, you know, really make sure
you're
working on something you care about. Because
it's just
not fun to like, like, ugh, in 2008 they
separated the speakers and the abstracts.
And it's like,
it's just, it's annoying, but again, it's
the most
important part I would say.
You know, so much of data science is taking
data that's either unstructured or structured
in the wrong
format to you and, you know, getting it into
the way, you know, into the structure that
you
need to do whatever analysis you want to do.
So in this case, that's taking, you know,
html
on a page and converting it into a PosGres
database.
And so we have done that now. And again,
take my word that, you know, I've done this
for the other years as well. Back in 2007
and so we have a total of 497 talks
in here from RailsConfs over the years. And
so
that's cool. That's basically our dataset
that we're gonna
use.
And so we can sort of move on to,
you know, step two of the project here, which
is, you know, do the n-gram calculation and
store
the results. And so let's go back to talk.rb.
All this by the way is just in, you
know, app/models/talk.rb. That's where all
this code is.
And I have another empty method somewhere
called def
ngrams. And so this method, we're gonna need
to
give, you know, it goes on a talk. So
given a value of n, calculate on the ngrams
from that talk's abstract.
And so, what are we gonna do here? So
again, let's look at, talk dot mine. Dot abstract.
So here's the abstract, and we need to, you
know, get ngrams out of this. And so the
first thing, I've written a little helper
method over
here. Which I've just tacked on a string called
normalized_for_ngrams. And you know, what
does this do? Well,
it downcases it, cause we're gonna do case
insensitive.
There might be cases where you want to keep
case sensitivity. Whatever. Doesn't really
matter. In this case
we're gonna go case insensitive.
Squish is a nice, convenient method that will
kind
of standardize the white space for you. So
like,
if there's any trailing or leading white space,
and
if there's like a bunch of middle white space,
this will, it'll kill the beginning and ending
and
it'll turn anything in the middle into a single
space.
So that way you just don't have to worry
about things like double spaces or, you know,
other,
other weird things that can happen. Cause
of course
it's the web. Whatever can go wrong will go
wrong. So make sure that you're data's in
some
kind of standardized format.
And the last thing I've done is removed punctuation.
And the reason for that is just cause like,
you know, there's commas, periods, colons,
all sorts of
stuff like that. We don't really care about
it.
And so let's just kill any character that's
not
either a space or a word character. This is
kind of the, little like, Ruby special regex
thing.
So we're gonna kill punctuation.
And so we can actually just mess with this
in the console maybe. So let's take our little
example sentence. You know, this talk is boring.
And
let's normalize that for ngrams. OK. All it
did
was downcase it. And now we want to get
that into an array of words, which we can
just do with split. Cool.
And now there's actually this neat little
Ruby enumerable
thing, which I didn't know about until pretty
recently.
Each const, which stands for each consecutive.
And it
takes an argument, a single number, like two,
and
what this says is give me all of the,
you know, consecutive pairs of two. So if
we
to_a this, now we have this array of arrays,
which looks like exactly what we want.
This talk, talk is, and is boring. And so
the last thing we can do there is we
can just map that array to make these just
phrases.
So cool. So this is actually the entirety
of
our ngrams method, is just, you know, this
code
right here. So let's copy and paste this into
the old method here. So we want. We're doing
this on the abstract. Let's get some new lines
here.
All right, cool. So again, just to recap,
you
take the abstract, we normalize it, which
means, you
know, downcase and kill the punctuation. We
split it
to words. Uh, wait. Actually this should not
be
two. That should be n. And then we join
those. So let's, let's see if this worked.
So talk dot mine again. And one. OK. So
here are all the one grams, which is just
the sequence of words. And that looks correct.
And
all of the two grams. Also looks correct,
I
think. Yeah. To get, get a, yeah, OK, perfect.
And so this is kind of the, the method
we're gonna use to decompose these talks into
just,
you know, an array of words and phrases. And
so what is the next step, now that we
have this method? Well, the next step is we
have to build these indexes that we're actually
gonna
use to look up, you know, the final results.
And so for that, we're gonna use redis.
Now, we don't have sort of enough time to
really get totally into the details of redis.
But,
you know, the, the thing that we're really
gonna
use is the, the sorted set data structure,
which
I'd definitely encourage you to check out.
It's a
great data structure. Great feature of redis.
And so
what is a sorted set?
Well, it's got the word set in it, so
that tells you something. It's, you know,
unique elements.
And the, the neat feature of a sorted set
is that each element in the set also has
a score associated with it. So the way we
can use this is, remember, again, the question
I'm
gonna answer is, like, you know, if someone
searches
for Ember, you know, how many times was Ember
mentioned in 2007. How many times was it mentioned
in 2008. How many times was it mentioned in
2009?
So we're gonna have one sorted set for each
year, where the members of the sorted set
are
all the words and phrases that appeared in
RailsConf
talks, and the scores are the number of times
that those ngrams appeared.
And then, you know, redis is very efficient
about
this zscore method. You can look up. It's
like
this command right here would say, OK, in
the
sorted set for 2014, get me the score associated
with the member ember. And that's gonna tell
you,
you know, some number. Like, three or whatever.
Is
the number of times it gets mentioned.
So what we have to do is build these
sorted sets. One for each year again. And
again
I have an empty method called generate_ngram_data_by_year.
So iterate
through all talks from a given year, you know,
calculate the ngram counts and add it to the
appropriate redis sorted set. So let's write
that.
So one thing we need to do is make
sure we're not double counting. So if we have
an old sorted set sitting around, let's delete
it.
So let's, redis.delete year. We need to decide
what
values of n we're gonna use. So let's just
say one, two, and three, meaning we're gonna
calculate
all the one grams, two grams, three grams.
Anything
longer than that and it's sort of, like, what's
even the point. You're getting into pretty
specific sentences.
There's not gonna be a lot of repetition.
So now we need to iterate through each talk
for the given years. Where(:year => year).find_each.
And then
for each talk we need to iterate through each
value of n. And then for each value of
n, what do we need to do? We need
to calculate the ngram, so do talk dot ngrams.
This is the method we just wrote. We're gonna
pass it n.
Do |ngram|.
And then finally, we're going to add this
to
the relevant redis sorted set. So the command
for
that is redis.zincrby.
And this goes, you give it a year, you
give it a number, like one, and you give
it what are you incrementing.
OK. So let's look at this method now. We're
gonna take, give it a year. We're gonna go
through every talk from that year. We're gonna
go
through values of n, which is one, two and
three, so let's say one, OK. Get the talk.
Calculate all of its one grams. And then for
each one gram, add to the year sorted set
the value of one for that ngram. And then
do that just a bunch of times.
So let's see if this works.
Let's reload. Again to prove I'm not lying.
There's
nothing in redis at the moment. Oops. Gotta
do
talk.
Let's worry about those Delayed::Jobs. Perfect.
Drink break.
So it's going through each year now. And each
talk in each year, counting up all the words
and phrases and building our sorted sets.
And it
is done.
So let's see what we got in here now.
OK, cool. So we got these keys. Let's, let's
look into one of these. One of the nice
things about the sorted set is you can, of
course, sort by it. And so the command here
is zrevrange. So we can do the 2014 sorted
set. So this is gonna give us the top
ten, or actually eleven, top eleven, you know,
ngrams
in 2014. So let's see.
And we can actually add :with_scores = true.
So
the most common words and phrases from 2014
RailsConf
talk abstracts. Not very surprising. The,
to, and, a,
of, in, you, how. Rails. OK. Rails makes the
number ten.
So there you go.
Now we can also, let's just have a little
fun here. See what some of the sort top
non-trivial ones are. Obviously you could
write some code,
maybe kill stop words. Stuff like that. If
you
don't care about them.
But, so. Rails. Can code. This talk. Most
popular
two-word phrase. Pretty good. How to. Ruby
developers. Eh,
this looks pretty, pretty relevant, right.
I mean, these
are not words you'd be surprised to see in
a RailsConf talk abstract.
So those, you know, are the most common words.
So we now have this. We have this for
every year, by the way. So we can also
do something, this is the same thing for 2011.
Whatever. And the last piece of code we're
going
to write, is we need to be able to
query this data.
So, you know, the actual, sort of, website
or
finished product, you're gonna have to, you
know, search
for a term. And you're gonna have to go
look up in your data, you know, what, what
are the relevant values for that term.
And so, how we're gonna do this. Well, the
first thing we gotta remember is that we normal-
remember we did this normalize for ngrams
thing. So
we have to do that again, because what if
someone searches for a capitalized word or
with something
with punctuation. We have to process it the
exact
same way that we processed our input. Otherwise
it
won't match. So let's just do that.
And then we have this constant ALL_YEARS.
And we're
gonna iterate through that with an object
with a
hash. Let's just build up a hash. That's probably
the easy way to do it. Do |year, hash|.
And the, the relevant redis command, again,
is zscore.
So we can do redis dot zscore(). We're gonna
look up in the hash for that year, the
term. And we need to put this actually in
the hash. And so, and then we need to
to_i that in case it's nil.
OK. So this now, what does this say? ALL_YEARS
is just, you know, 2007 through 2014. Go through
each of those years. And then build up our
hash so that the hash, the key of the
year, maps to the value of, you know, the
number of times that term appeared in that
year.
So let's, again, see if that works. Talk dot
query, you know, ruby or something. Cool.
So in
2007 it was mentioned 52 times, 2014 22 times.
Whatever. We can, I guess, we said Ember originally.
And there you go. It was not mentioned until
this year. Which is also kind of telling.
And so this is basically, you know, all of
the kind of step two code you need. That's
sort of the ngram calculation, store the results.
And
again, I reiterate, like, everything we just
did, is
kind of trivially simple. There's no fancy
algorithms. It's
just counting, you know, putting stuff in
the right
data structure. Accessing it in sort of the
right
way.
And I just think there's something like pretty,
you
know, insightful about that, that you don't
need to
do fancy things all the time. And that often
the kind of the coolest results will come
from
something simple.
And so, as I said, the last thing we're
gonna do here is create this nice front end
interface that lets us investigate the results.
You know,
unfortunately, we don't really have time to
get into
that. It is all on the GitHub. But, I
will tell you, I use pie charts as a
nice library, front-end library that makes
it very simple
to get charts up and running. It's actually
not
that much code.
And I've done this already. So let's start
up
a server. And, oops. Let's fire up the localhost.
And so here we are. The abstractogram is our
app. So what are we, what are we gonna
search for here?
Let's see. I, you, we or something. And there
we go. So there, there it is. The number
of times the word you appears in each year.
Looks pretty flat. So, you know, the, these
are
kind of constant. Anyone have any, anything
else they
want to search for? Let's try ember, backbone.
All right. Let's say, we got, PosGres I heard.
All right. I guess we could all say, let's
say SQL. No one cares about PosGres this year.
Service. SOA. Oh, there is sort of a rising
trend of service-oriented architecture.
Anything else?
TDD. That's a good one. TDD. Testing. Test-driven,
how
about. So there we go. I'm sorry?
Rest. That's a trick one though, cause rest
is
also like a real word that, you know, like,
the rest of the time will be something else.
And. Refactor. Let's see. Ooh. That's a good
one.
DHH. Wow. Peaked 2011, peak DHH. Let's see,
we
got, Heroku is a good one. On the rise.
I like we can just look at Ruby and
Rails. This is actually, I think, pretty relevant.
It's
like, what are people talking about? Not Rails
anymore.
We got to find something new to talk about.
You know, it's like, too many RailsConfs.
And, in
fact, this actually came up at the, you know,
there was a speaker meeting, whatever, and
everyone was
talking about how, you know, their talks weren't
actually
about Rails.
And, you know, maybe this is actually an insightful
statement, that, you know, the, the community
has obviously
gotten very large and there's just a ton of
other stuff to talk about. People have been
talking
about Rails for a long time. And so, you
know, here I am giving a talk that's not
really directly about Rails. But, so maybe
this is
like a real trend that people are just finding
other stuff to talk about.
And that is pretty cool. So I promised that
I would show you the repo or whatever on
GitHub. You can just do bit.ly slash railsconfdata.
It's
just the code. Everything we've looked at
today. Plus
some more stuff. It's actually running live
on the
internet at abstractogram dot herokuapp dot
com.
I figure the internet's probably not working,
but let's
see. Yup. Classic. And, you know, otherwise
that is
it. And thank you for listening. And I think
we have time for questions.