< Return to Video

RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider

  • 0:17 - 0:18
    TODD SCHNEIDER: All right. We're, we're good.
    Thank you.
  • 0:18 - 0:20
    Sorry for the delay. Classic.
  • 0:20 - 0:22
    Even in the future nothing works. Welcome.
  • 0:22 - 0:26
    I am Todd. I'm an engineer at Rap Genius.
  • 0:26 - 0:32
    And today's talk is going to be about data
    science with a live tutorial.
  • 0:32 - 0:34
    And before we get into the live coding component,
  • 0:34 - 0:36
    I wanted to show you all a project I
  • 0:36 - 0:39
    built previously, which kind of serves as
    the inspiration
  • 0:39 - 0:41
    for this talk. Sort of. So this is a
  • 0:41 - 0:45
    website called weddingcrunchers dot com. What
    is Wedding Crunchers?
  • 0:45 - 0:48
    It's a place where you can track the, the
  • 0:48 - 0:51
    popularity of words and phrases in the New
    York
  • 0:51 - 0:54
    Times wedding section over the past thirty-some
    years.
  • 0:54 - 0:56
    And a lot of you might be wondering why
  • 0:56 - 0:59
    on earth would this be interesting or relevant
    or
  • 0:59 - 1:02
    funny or anything, and I hope to convince
    you
  • 1:02 - 1:04
    of that very quickly. Here is a, a example
  • 1:04 - 1:07
    wedding announcement from the New York Times.
    This one's
  • 1:07 - 1:08
    from 1985.
  • 1:08 - 1:09
    If you don't know me, you don't live in
  • 1:09 - 1:11
    New York, read the New York Times, the wedding
  • 1:11 - 1:14
    section is a certain cultural cache. It's
    kind of
  • 1:14 - 1:16
    an honor to be listed in there and it's
  • 1:16 - 1:19
    got a very resume-like structure. People get
    to brag
  • 1:19 - 1:20
    about where they went to school and what they
  • 1:20 - 1:21
    do.
  • 1:21 - 1:23
    So here is an example. You know, Diane deCordova
  • 1:23 - 1:25
    is marrying Michael Monro Lewis. They both
    went to
  • 1:25 - 1:28
    Princeton. They graduated Cum Laude. You know,
    she works
  • 1:28 - 1:30
    at Morgan Stanley. He works at Solomon Brothers
    in
  • 1:30 - 1:33
    New York and they're gonna go to London. And
  • 1:33 - 1:34
    this should be a little familiar to a bunch
  • 1:34 - 1:35
    of you.
  • 1:35 - 1:38
    Mr. Lewis and associates Solomon Brothers
    is Michael Lewis.
  • 1:38 - 1:41
    He's given you Right Lawyers Poker??, famous
    book about
  • 1:41 - 1:43
    his experience there. And before, before he
    was a
  • 1:43 - 1:46
    famous writer, he was just another New York
    Times
  • 1:46 - 1:50
    wedding announced person.
  • 1:50 - 1:52
    And so what Wedding Crunchers does is it takes
  • 1:52 - 1:55
    the entire corpus of New York Times wedding
    announcements
  • 1:55 - 1:57
    back from 1981 and you can searh for words
  • 1:57 - 2:00
    and phrases and you can see how common those
  • 2:00 - 2:02
    words and phrases are, you know, by year.
    It's
  • 2:02 - 2:03
    like, this is a good one that's relevant to
  • 2:03 - 2:06
    people here. You know, banker and programmer.
    You know,
  • 2:06 - 2:09
    for example, when you list so-and-so is a
    banker
  • 2:09 - 2:12
    or is a programmer in the announcement and
    you
  • 2:12 - 2:14
    see, over time, you know, banker used to be
  • 2:14 - 2:18
    way more commonly used than programmer in
    these announcements.
  • 2:18 - 2:21
    And only just this year, in 2014, programmer
    has
  • 2:21 - 2:28
    finally overtaken banker as, you know, the,
    the place,
  • 2:28 - 2:30
    you know, the people getting married in New
    York,
  • 2:30 - 2:33
    who are part of society, come from. Another
    good
  • 2:33 - 2:35
    one is, if you look at goldman, sachs and
  • 2:35 - 2:38
    google- is my internet on? Good.
  • 2:38 - 2:41
    So here's another good one. So Goldman Sachs,
    you
  • 2:41 - 2:44
    know, classic New York financial instition.
    Google, new kid
  • 2:44 - 2:47
    on the block. Tech scene. Boom. Taking over.
  • 2:47 - 2:50
    And, you know, this is obviously fun, and
    it's
  • 2:50 - 2:52
    amusing. But it's also actually pretty insightful
    for a
  • 2:52 - 2:56
    relatively simple concept. I mean, this one
    graph tells
  • 2:56 - 2:59
    a pretty powerful story of, you know, New
    York
  • 2:59 - 3:02
    the, the finance capitol of the world. Meanwhile,
    we
  • 3:02 - 3:04
    have this sort of emerging tech scene. You
    know,
  • 3:04 - 3:05
    Google may be the biggest player in the kind
  • 3:05 - 3:07
    of new tech world.
  • 3:07 - 3:10
    And now, when you turn to the society pages
  • 3:10 - 3:11
    to see who's getting married, you know, there's
    more
  • 3:11 - 3:14
    employees from Google than there are from
    Gullman Sachs.
  • 3:14 - 3:17
    And that, you know, kind of interesting thing
    in
  • 3:17 - 3:18
    the world.
  • 3:18 - 3:20
    And so what we're gonna do today is build
  • 3:20 - 3:25
    something just like Wedding Crunchers, except,
    instead of using
  • 3:25 - 3:28
    the text of wedding announcements to analyze,
    we're going
  • 3:28 - 3:33
    to look at all of the RailsConf talk abstracts.
  • 3:33 - 3:34
    And so, you know, hopefully this is, this
    is
  • 3:34 - 3:37
    interesting to people here and, I always say,
    you
  • 3:37 - 3:39
    know, if there's only one thing you take from
  • 3:39 - 3:41
    this talk, really, what it should be is that,
  • 3:41 - 3:44
    you know, work on a problem that's interesting
    to
  • 3:44 - 3:46
    you. Because, especially when you're dealing
    with data science,
  • 3:46 - 3:48
    a lot of it's pretty messy and then you
  • 3:48 - 3:49
    have to go through scraping stuff as we'll
    get
  • 3:49 - 3:52
    into, and it's easy to get frustrated and
    kind
  • 3:52 - 3:54
    of lost and like, if you're not working on
  • 3:54 - 3:55
    something that you care about, and something
    that you
  • 3:55 - 3:58
    really want to know, kind of, the final result,
  • 3:58 - 4:00
    it's just much easier to get distracted and
    kind
  • 4:00 - 4:01
    of, ultimately, bail.
  • 4:01 - 4:04
    So, again, if you take one thing, just work
  • 4:04 - 4:08
    on something that is interesting to you. So
    the
  • 4:08 - 4:10
    particular kind of analysis we're gonna do
    is something
  • 4:10 - 4:13
    called n-gram analysis. And I have a little
    example
  • 4:13 - 4:14
    set up here. So what is an n-gram? You
  • 4:14 - 4:16
    may have heard the word before.
  • 4:16 - 4:19
    Really, all it means is, you know, a, a
  • 4:19 - 4:24
    consecutive words as part of a sentence. So
    like,
  • 4:24 - 4:26
    examples very simple, for one simple. This
    talk is
  • 4:26 - 4:28
    boring. What are the, what are the one grams
  • 4:28 - 4:30
    in this sentence? It's just the words. This,
    talk,
  • 4:30 - 4:33
    is, and boring. The two grams are every pair
  • 4:33 - 4:36
    of consecutive words. This talk, talk is,
    is boring,
  • 4:36 - 4:37
    and so on.
  • 4:37 - 4:38
    And so what we need to be able to
  • 4:38 - 4:41
    do in order to build, you know, a graph
  • 4:41 - 4:43
    like this, is we need to take a term
  • 4:43 - 4:45
    that's, you know, relavent to RailsConf, say
    something like
  • 4:45 - 4:47
    Ember or whatever, and we need to be able
  • 4:47 - 4:49
    to look up, you know, for each year how
  • 4:49 - 4:51
    many times does this, you know, word or n-gram
  • 4:51 - 4:54
    appear in the data.
  • 4:54 - 4:56
    And so that is what we are going to
  • 4:56 - 4:59
    build. And I have this brief little outline
    here.
  • 4:59 - 5:01
    There's kind of three steps. And this is pretty
  • 5:01 - 5:05
    general to, to any data project. You know,
    step
  • 5:05 - 5:07
    one is gonna be just gathering the data, getting
  • 5:07 - 5:10
    it in some usable form. Step two is gonna
  • 5:10 - 5:11
    be kind of the analysis part where we do
  • 5:11 - 5:14
    the n-gram calculation. We store the results.
    And then
  • 5:14 - 5:16
    step three is gonna be to create a nice
  • 5:16 - 5:19
    little front-end interface that lets us investigate,
    visualize and
  • 5:19 - 5:21
    see what we've done.
  • 5:21 - 5:23
    Now unfortunately, you know, in a, in a thirty
  • 5:23 - 5:26
    minute talk we can't possibly do all of this.
  • 5:26 - 5:29
    So we're gonna focus more on items one and
  • 5:29 - 5:31
    two and less so on three, and even then
  • 5:31 - 5:33
    it's too much. So, you know, I sort of
  • 5:33 - 5:35
    used the analogy, it'll be a bit like watching
  • 5:35 - 5:37
    TV on the Food Network, where we might, you
  • 5:37 - 5:40
    know, throw something in the oven, mysteriously
    something else
  • 5:40 - 5:42
    pops out of the other oven even though it's,
  • 5:42 - 5:44
    where did that come from?
  • 5:44 - 5:46
    But not to worry. Everything is also on GitHub.
  • 5:46 - 5:48
    There's a repo I'll share with you at the
  • 5:48 - 5:50
    end. So anything that we don't cover or that
  • 5:50 - 5:52
    we cover too quickly or something, you'll
    be able
  • 5:52 - 5:54
    to see sort of the, the full version on
  • 5:54 - 5:56
    GitHub.
  • 5:56 - 5:58
    So let us jump in now to step one,
  • 5:58 - 6:00
    which is, you know, gathering the data. And
    so
  • 6:00 - 6:02
    let's take a look back at the, the RailsConf
  • 6:02 - 6:03
    website again. So we have to figure out how
  • 6:03 - 6:06
    we're gonna model a, a RailsConf talk in our
  • 6:06 - 6:10
    database. So like, what, you know, attributes
    does a,
  • 6:10 - 6:13
    do a, excuse me, does a RailsConf talk have.
  • 6:13 - 6:14
    And it's like, one thing we see is they
  • 6:14 - 6:18
    all have titles. So that looks like something.
    They
  • 6:18 - 6:20
    have speakers. You know, there's this thing,
    which is
  • 6:20 - 6:23
    the abstract, and then there's the bio. And
    that's
  • 6:23 - 6:25
    probably it. That's probably all we need.
  • 6:25 - 6:28
    So that's pretty simple. And, you know, I
    have
  • 6:28 - 6:30
    the little migration. I've already run here.
    But here
  • 6:30 - 6:32
    are attributes for talks. It's just the year,
    you
  • 6:32 - 6:34
    know, what, what conference were we actually
    at. The
  • 6:34 - 6:36
    title of the talk, the speaker, the abstract,
    and
  • 6:36 - 6:38
    the bio.
  • 6:38 - 6:41
    And so also, that's, again, pretty straightforward.
    The gemfile
  • 6:41 - 6:45
    is also very simple. It's mostly pretty boiler
    plate.
  • 6:45 - 6:48
    Rails 4, Ruby 2.1. The only gems I wanted
  • 6:48 - 6:49
    to call out here are, we're gonna use nokogiri
  • 6:49 - 6:52
    for, you know, fetching, or, parsing websites
    and kind
  • 6:52 - 6:54
    of scraping the data we need. We're gonna
    use
  • 6:54 - 6:56
    PosGres as our main data store and we're gonna
  • 6:56 - 6:58
    use redis to build these sort of index that
  • 6:58 - 7:00
    we can ultimately use to look up, you know,
  • 7:00 - 7:02
    how common a word is.
  • 7:02 - 7:05
    And so one thing that's not here is, like,
  • 7:05 - 7:09
    you know, gem fancy data algorithm. And a
    lot
  • 7:09 - 7:11
    of people, this is kind of where Ruby often
  • 7:11 - 7:13
    gets a bad reputation of, you know, not being
  • 7:13 - 7:16
    supportive of scientific computing or whatever.
    And other languages
  • 7:16 - 7:19
    have more, more support. But my claim is that
  • 7:19 - 7:21
    it's really not that important. You can get
    a
  • 7:21 - 7:24
    ton of mileage out of very simple tools that
  • 7:24 - 7:24
    you can build yourself.
  • 7:24 - 7:26
    You know, you don't need a fancy gem or
  • 7:26 - 7:28
    any fancy algorithm. Those things are cool
    too and
  • 7:28 - 7:31
    they have their place. But they're not needed
    a
  • 7:31 - 7:33
    lot of the time. And, you know, Ruby is
  • 7:33 - 7:36
    a wonderful language for, especially, scraping
    stuff from the
  • 7:36 - 7:38
    web. There's a ton of support there. And so
  • 7:38 - 7:41
    I don't think that the, the lack of, you
  • 7:41 - 7:44
    know, fancy algorithm gems should necessarily
    be a deterrant
  • 7:44 - 7:44
    at all.
  • 7:44 - 7:47
    And so hopefully part of this talk is convincing
  • 7:47 - 7:50
    people that Ruby and Rails are actually quite
    well-suited
  • 7:50 - 7:51
    to problems like this.
  • 7:51 - 7:54
    OK. So now we actually need to write some
  • 7:54 - 7:56
    code to scrape the talk. And you know, if
  • 7:56 - 7:57
    you've ever done anything like this before,
    you know
  • 7:57 - 8:00
    that Chrome Inspector is your best friend.
    So let's
  • 8:00 - 8:02
    fire that up. We're gonna inspect element,
    and so
  • 8:02 - 8:04
    like, we actually, what we need to do now
  • 8:04 - 8:07
    is take you know, this HTML on the page
  • 8:07 - 8:09
    and turn it into a database record that we
  • 8:09 - 8:12
    can then, you know, use to our advantage later.
  • 8:12 - 8:13
    And so it looks like, you know, all the
  • 8:13 - 8:17
    talks are in these session classes. So that's
    something.
  • 8:17 - 8:20
    We can look in here. This looks like something.
  • 8:20 - 8:23
    So let's make this bigger.
  • 8:23 - 8:25
    And you know it helps to, well, it's kind
  • 8:25 - 8:29
    of essential to be decent with CSS selectors
    here,
  • 8:29 - 8:32
    because that's how we're going to basically
    find stuff.
  • 8:32 - 8:35
    So let's see, OK, so there's eighty-one session
    divs.
  • 8:35 - 8:38
    That sounds about right. I happen to know
    that
  • 8:38 - 8:42
    mine is number seventy-eight, so let's, let's
    look at
  • 8:42 - 8:44
    that. And so here we are. So we need
  • 8:44 - 8:47
    to, again, the, the things we're mod- or,
    the
  • 8:47 - 8:50
    attributes we're storing at the title, the
    speaker, the
  • 8:50 - 8:53
    abstract, and the bio. And so we're gonna
    need
  • 8:53 - 8:55
    to pull these things out.
  • 8:55 - 8:58
    So let's see. It looks like the, the title
  • 8:58 - 9:00
    is in this h1 element inside the header. So
  • 9:00 - 9:05
    let's just make sure that works. You know,
    header
  • 9:05 - 9:08
    h1. That looks right.
  • 9:08 - 9:14
    The, the speaker looks to be the header h2.
  • 9:14 - 9:16
    Cool.
  • 9:16 - 9:21
    Now the abstract is in this p tag, so
  • 9:21 - 9:23
    we can do something like this. But this is
  • 9:23 - 9:26
    actually not quite right. So what's wrong
    with this?
  • 9:26 - 9:30
    Well, the abstract ends, you know, suited
    to the
  • 9:30 - 9:32
    problem. The bio here is also in the p
  • 9:32 - 9:35
    tag. Originally a math guy. And we've actually
    pulled
  • 9:35 - 9:37
    all the p-tags. So we need a way of
  • 9:37 - 9:39
    not doing that. And this is where you just
  • 9:39 - 9:40
    need to know a little bit of CSS. Not
  • 9:40 - 9:43
    very complicated. But if you use the little
    greater
  • 9:43 - 9:45
    than guy, what this says is only take the
  • 9:45 - 9:47
    p tags that are immediate descendants of the
    session
  • 9:47 - 9:50
    div. And so now we have, you know, only
  • 9:50 - 9:51
    the abstract.
  • 9:51 - 9:54
    And lastly, you know, the bio is just in
  • 9:54 - 9:58
    its own little section. So something like
    that. Cool.
  • 9:58 - 10:00
    So that is the jQuery version of it. We
  • 10:00 - 10:03
    need to do this, though, in Ruby. And as
  • 10:03 - 10:05
    I said, this does sometimes get a little tedious.
  • 10:05 - 10:07
    But let's, let's write the code. So I have
  • 10:07 - 10:12
    this empty method - create_railsconf_2014_talks.
    And also this method
  • 10:12 - 10:15
    I've written already called fetch_and_parse,
    which just gets a
  • 10:15 - 10:17
    URL and sends it to nokogiri, which we can
  • 10:17 - 10:18
    then use to do our CSS selectors.
  • 10:18 - 10:21
    So let, let's just write this. So we can
  • 10:21 - 10:27
    say doc is fetch_and_parse. The url is this.
    Let's
  • 10:27 - 10:34
    see if this works in the console.
  • 10:34 - 10:41
    Of course, in here. Do I have internet? Nice.
  • 10:47 - 10:53
    So we can then check the same thing. Again.
  • 10:53 - 10:58
    Looks right. Let's find my talk, which, this
    part
  • 10:58 - 10:59
    I couldn't possibly tell you. When you use
    the
  • 10:59 - 11:02
    nokogiri, the eq thing, you have to add two
  • 11:02 - 11:04
    from whatever jQuery does. So I'm number 80
    now.
  • 11:04 - 11:07
    Don't ask me why. I couldn't possibly tell
    you.
  • 11:07 - 11:10
    But maybe someone here knows. Be curious to
    find
  • 11:10 - 11:11
    out.
  • 11:11 - 11:12
    AUDIENCE: ?? (00:11:13)
  • 11:12 - 11:15
    T.S.: So there it is. There's the title. So
  • 11:15 - 11:17
    let us now write some code here. We have
  • 11:17 - 11:22
    our, our document. We're gonna go through
    each session.
  • 11:22 - 11:24
    The CSS method is kind of like, you know,
  • 11:24 - 11:29
    the selector for nokogiri. Each elements.
    So each of
  • 11:29 - 11:35
    these we're gonna create a talk.
  • 11:35 - 11:38
    And again. So the year we already know is
  • 11:38 - 11:45
    2014. The title we're gonna say is, elm.css("header
    h1").inner_text.
  • 11:48 - 11:55
    Speaker, header h2, dun nuh nuh dun nuh nuh
  • 12:00 - 12:05
    nuh. Gettin' there.
  • 12:05 - 12:10
    All right. So I think this will probably work.
  • 12:10 - 12:14
    Let's find out. And so we're back in here.
  • 12:14 - 12:19
    Just to prove to you that I'm not lying,
  • 12:19 - 12:23
    2014 dot count. There's none of them. And,
    what'd
  • 12:23 - 12:26
    I call this method? This guy. Delayed::Job.
  • 12:26 - 12:33
    All right. So we just did something. Did it
  • 12:33 - 12:40
    work? Nice. We got eighty-one talks. Most
    importantly, let's,
  • 12:41 - 12:42
    we have my talk. That's the, that's the only
  • 12:42 - 12:47
    one that matters anyway. And so, you know,
    you
  • 12:47 - 12:48
    might be thinking now, like, you know, what
    the
  • 12:48 - 12:50
    heck, I came to the, the data science talk,
  • 12:50 - 12:52
    not the scraping talk. You know, to that,
    I
  • 12:52 - 12:56
    would say, tough luck. They're the same thing.
    You
  • 12:56 - 12:58
    know, you might not, you might not want to
  • 12:58 - 13:00
    hear it, but guess what, this is usually the
  • 13:00 - 13:02
    most important part of the entire project.
  • 13:02 - 13:05
    It's the hardest part, you know, because guess
    what,
  • 13:05 - 13:07
    just because we got the 2014 talks, you know,
  • 13:07 - 13:09
    now we have to get the 2013 talks. And
  • 13:09 - 13:11
    the 2012 talks. And they're all on different
    websites.
  • 13:11 - 13:13
    They all have different structures. You know,
    you're gonna
  • 13:13 - 13:15
    have to write different code to get each type
  • 13:15 - 13:17
    of website. It's a pain. And this is why
  • 13:17 - 13:19
    I said earlier, you know, really make sure
    you're
  • 13:19 - 13:21
    working on something you care about. Because
    it's just
  • 13:21 - 13:24
    not fun to like, like, ugh, in 2008 they
  • 13:24 - 13:27
    separated the speakers and the abstracts.
    And it's like,
  • 13:27 - 13:29
    it's just, it's annoying, but again, it's
    the most
  • 13:29 - 13:30
    important part I would say.
  • 13:30 - 13:33
    You know, so much of data science is taking
  • 13:33 - 13:36
    data that's either unstructured or structured
    in the wrong
  • 13:36 - 13:39
    format to you and, you know, getting it into
  • 13:39 - 13:41
    the way, you know, into the structure that
    you
  • 13:41 - 13:43
    need to do whatever analysis you want to do.
  • 13:43 - 13:45
    So in this case, that's taking, you know,
    html
  • 13:45 - 13:48
    on a page and converting it into a PosGres
  • 13:48 - 13:49
    database.
  • 13:49 - 13:53
    And so we have done that now. And again,
  • 13:53 - 13:54
    take my word that, you know, I've done this
  • 13:54 - 13:57
    for the other years as well. Back in 2007
  • 13:57 - 14:01
    and so we have a total of 497 talks
  • 14:01 - 14:04
    in here from RailsConfs over the years. And
    so
  • 14:04 - 14:07
    that's cool. That's basically our dataset
    that we're gonna
  • 14:07 - 14:07
    use.
  • 14:07 - 14:09
    And so we can sort of move on to,
  • 14:09 - 14:11
    you know, step two of the project here, which
  • 14:11 - 14:14
    is, you know, do the n-gram calculation and
    store
  • 14:14 - 14:17
    the results. And so let's go back to talk.rb.
  • 14:17 - 14:19
    All this by the way is just in, you
  • 14:19 - 14:22
    know, app/models/talk.rb. That's where all
    this code is.
  • 14:22 - 14:26
    And I have another empty method somewhere
    called def
  • 14:26 - 14:28
    ngrams. And so this method, we're gonna need
    to
  • 14:28 - 14:30
    give, you know, it goes on a talk. So
  • 14:30 - 14:32
    given a value of n, calculate on the ngrams
  • 14:32 - 14:35
    from that talk's abstract.
  • 14:35 - 14:36
    And so, what are we gonna do here? So
  • 14:36 - 14:43
    again, let's look at, talk dot mine. Dot abstract.
  • 14:44 - 14:45
    So here's the abstract, and we need to, you
  • 14:45 - 14:49
    know, get ngrams out of this. And so the
  • 14:49 - 14:51
    first thing, I've written a little helper
    method over
  • 14:51 - 14:54
    here. Which I've just tacked on a string called
  • 14:54 - 14:57
    normalized_for_ngrams. And you know, what
    does this do? Well,
  • 14:57 - 15:00
    it downcases it, cause we're gonna do case
    insensitive.
  • 15:00 - 15:02
    There might be cases where you want to keep
  • 15:02 - 15:04
    case sensitivity. Whatever. Doesn't really
    matter. In this case
  • 15:04 - 15:06
    we're gonna go case insensitive.
  • 15:06 - 15:09
    Squish is a nice, convenient method that will
    kind
  • 15:09 - 15:11
    of standardize the white space for you. So
    like,
  • 15:11 - 15:14
    if there's any trailing or leading white space,
    and
  • 15:14 - 15:17
    if there's like a bunch of middle white space,
  • 15:17 - 15:19
    this will, it'll kill the beginning and ending
    and
  • 15:19 - 15:21
    it'll turn anything in the middle into a single
  • 15:21 - 15:21
    space.
  • 15:21 - 15:22
    So that way you just don't have to worry
  • 15:22 - 15:25
    about things like double spaces or, you know,
    other,
  • 15:25 - 15:27
    other weird things that can happen. Cause
    of course
  • 15:27 - 15:29
    it's the web. Whatever can go wrong will go
  • 15:29 - 15:32
    wrong. So make sure that you're data's in
    some
  • 15:32 - 15:33
    kind of standardized format.
  • 15:33 - 15:36
    And the last thing I've done is removed punctuation.
  • 15:36 - 15:38
    And the reason for that is just cause like,
  • 15:38 - 15:40
    you know, there's commas, periods, colons,
    all sorts of
  • 15:40 - 15:43
    stuff like that. We don't really care about
    it.
  • 15:43 - 15:45
    And so let's just kill any character that's
    not
  • 15:45 - 15:47
    either a space or a word character. This is
  • 15:47 - 15:49
    kind of the, little like, Ruby special regex
    thing.
  • 15:49 - 15:53
    So we're gonna kill punctuation.
  • 15:53 - 15:54
    And so we can actually just mess with this
  • 15:54 - 15:57
    in the console maybe. So let's take our little
  • 15:57 - 16:00
    example sentence. You know, this talk is boring.
    And
  • 16:00 - 16:04
    let's normalize that for ngrams. OK. All it
    did
  • 16:04 - 16:08
    was downcase it. And now we want to get
  • 16:08 - 16:09
    that into an array of words, which we can
  • 16:09 - 16:13
    just do with split. Cool.
  • 16:13 - 16:17
    And now there's actually this neat little
    Ruby enumerable
  • 16:17 - 16:18
    thing, which I didn't know about until pretty
    recently.
  • 16:18 - 16:22
    Each const, which stands for each consecutive.
    And it
  • 16:22 - 16:25
    takes an argument, a single number, like two,
    and
  • 16:25 - 16:27
    what this says is give me all of the,
  • 16:27 - 16:30
    you know, consecutive pairs of two. So if
    we
  • 16:30 - 16:32
    to_a this, now we have this array of arrays,
  • 16:32 - 16:34
    which looks like exactly what we want.
  • 16:34 - 16:37
    This talk, talk is, and is boring. And so
  • 16:37 - 16:38
    the last thing we can do there is we
  • 16:38 - 16:44
    can just map that array to make these just
  • 16:44 - 16:44
    phrases.
  • 16:44 - 16:47
    So cool. So this is actually the entirety
    of
  • 16:47 - 16:50
    our ngrams method, is just, you know, this
    code
  • 16:50 - 16:52
    right here. So let's copy and paste this into
  • 16:52 - 16:56
    the old method here. So we want. We're doing
  • 16:56 - 17:03
    this on the abstract. Let's get some new lines
  • 17:03 - 17:04
    here.
  • 17:04 - 17:10
    All right, cool. So again, just to recap,
    you
  • 17:10 - 17:12
    take the abstract, we normalize it, which
    means, you
  • 17:12 - 17:15
    know, downcase and kill the punctuation. We
    split it
  • 17:15 - 17:17
    to words. Uh, wait. Actually this should not
    be
  • 17:17 - 17:21
    two. That should be n. And then we join
  • 17:21 - 17:24
    those. So let's, let's see if this worked.
  • 17:24 - 17:31
    So talk dot mine again. And one. OK. So
  • 17:31 - 17:33
    here are all the one grams, which is just
  • 17:33 - 17:36
    the sequence of words. And that looks correct.
    And
  • 17:36 - 17:42
    all of the two grams. Also looks correct,
    I
  • 17:42 - 17:45
    think. Yeah. To get, get a, yeah, OK, perfect.
  • 17:45 - 17:48
    And so this is kind of the, the method
  • 17:48 - 17:51
    we're gonna use to decompose these talks into
    just,
  • 17:51 - 17:54
    you know, an array of words and phrases. And
  • 17:54 - 17:56
    so what is the next step, now that we
  • 17:56 - 17:58
    have this method? Well, the next step is we
  • 17:58 - 17:59
    have to build these indexes that we're actually
    gonna
  • 17:59 - 18:04
    use to look up, you know, the final results.
  • 18:04 - 18:05
    And so for that, we're gonna use redis.
  • 18:05 - 18:07
    Now, we don't have sort of enough time to
  • 18:07 - 18:11
    really get totally into the details of redis.
    But,
  • 18:11 - 18:12
    you know, the, the thing that we're really
    gonna
  • 18:12 - 18:15
    use is the, the sorted set data structure,
    which
  • 18:15 - 18:16
    I'd definitely encourage you to check out.
    It's a
  • 18:16 - 18:19
    great data structure. Great feature of redis.
    And so
  • 18:19 - 18:20
    what is a sorted set?
  • 18:20 - 18:23
    Well, it's got the word set in it, so
  • 18:23 - 18:25
    that tells you something. It's, you know,
    unique elements.
  • 18:25 - 18:27
    And the, the neat feature of a sorted set
  • 18:27 - 18:29
    is that each element in the set also has
  • 18:29 - 18:32
    a score associated with it. So the way we
  • 18:32 - 18:35
    can use this is, remember, again, the question
    I'm
  • 18:35 - 18:37
    gonna answer is, like, you know, if someone
    searches
  • 18:37 - 18:39
    for Ember, you know, how many times was Ember
  • 18:39 - 18:40
    mentioned in 2007. How many times was it mentioned
  • 18:40 - 18:42
    in 2008. How many times was it mentioned in
  • 18:42 - 18:43
    2009?
  • 18:43 - 18:45
    So we're gonna have one sorted set for each
  • 18:45 - 18:48
    year, where the members of the sorted set
    are
  • 18:48 - 18:50
    all the words and phrases that appeared in
    RailsConf
  • 18:50 - 18:54
    talks, and the scores are the number of times
  • 18:54 - 18:56
    that those ngrams appeared.
  • 18:56 - 18:58
    And then, you know, redis is very efficient
    about
  • 18:58 - 19:00
    this zscore method. You can look up. It's
    like
  • 19:00 - 19:03
    this command right here would say, OK, in
    the
  • 19:03 - 19:06
    sorted set for 2014, get me the score associated
  • 19:06 - 19:09
    with the member ember. And that's gonna tell
    you,
  • 19:09 - 19:12
    you know, some number. Like, three or whatever.
    Is
  • 19:12 - 19:14
    the number of times it gets mentioned.
  • 19:14 - 19:16
    So what we have to do is build these
  • 19:16 - 19:19
    sorted sets. One for each year again. And
    again
  • 19:19 - 19:24
    I have an empty method called generate_ngram_data_by_year.
    So iterate
  • 19:24 - 19:26
    through all talks from a given year, you know,
  • 19:26 - 19:27
    calculate the ngram counts and add it to the
  • 19:27 - 19:30
    appropriate redis sorted set. So let's write
    that.
  • 19:30 - 19:32
    So one thing we need to do is make
  • 19:32 - 19:34
    sure we're not double counting. So if we have
  • 19:34 - 19:37
    an old sorted set sitting around, let's delete
    it.
  • 19:37 - 19:40
    So let's, redis.delete year. We need to decide
    what
  • 19:40 - 19:43
    values of n we're gonna use. So let's just
  • 19:43 - 19:46
    say one, two, and three, meaning we're gonna
    calculate
  • 19:46 - 19:48
    all the one grams, two grams, three grams.
    Anything
  • 19:48 - 19:50
    longer than that and it's sort of, like, what's
  • 19:50 - 19:52
    even the point. You're getting into pretty
    specific sentences.
  • 19:52 - 19:53
    There's not gonna be a lot of repetition.
  • 19:53 - 19:56
    So now we need to iterate through each talk
  • 19:56 - 20:03
    for the given years. Where(:year => year).find_each.
    And then
  • 20:06 - 20:08
    for each talk we need to iterate through each
  • 20:08 - 20:14
    value of n. And then for each value of
  • 20:14 - 20:16
    n, what do we need to do? We need
  • 20:16 - 20:17
    to calculate the ngram, so do talk dot ngrams.
  • 20:17 - 20:19
    This is the method we just wrote. We're gonna
  • 20:19 - 20:20
    pass it n.
  • 20:20 - 20:23
    Do |ngram|.
  • 20:23 - 20:26
    And then finally, we're going to add this
    to
  • 20:26 - 20:29
    the relevant redis sorted set. So the command
    for
  • 20:29 - 20:30
    that is redis.zincrby.
  • 20:30 - 20:35
    And this goes, you give it a year, you
  • 20:35 - 20:39
    give it a number, like one, and you give
  • 20:39 - 20:40
    it what are you incrementing.
  • 20:40 - 20:43
    OK. So let's look at this method now. We're
  • 20:43 - 20:45
    gonna take, give it a year. We're gonna go
  • 20:45 - 20:48
    through every talk from that year. We're gonna
    go
  • 20:48 - 20:51
    through values of n, which is one, two and
  • 20:51 - 20:53
    three, so let's say one, OK. Get the talk.
  • 20:53 - 20:55
    Calculate all of its one grams. And then for
  • 20:55 - 20:59
    each one gram, add to the year sorted set
  • 20:59 - 21:03
    the value of one for that ngram. And then
  • 21:03 - 21:05
    do that just a bunch of times.
  • 21:05 - 21:08
    So let's see if this works.
  • 21:08 - 21:14
    Let's reload. Again to prove I'm not lying.
    There's
  • 21:14 - 21:21
    nothing in redis at the moment. Oops. Gotta
    do
  • 21:21 - 21:22
    talk.
  • 21:22 - 21:29
    Let's worry about those Delayed::Jobs. Perfect.
    Drink break.
  • 21:30 - 21:33
    So it's going through each year now. And each
  • 21:33 - 21:35
    talk in each year, counting up all the words
  • 21:35 - 21:39
    and phrases and building our sorted sets.
    And it
  • 21:39 - 21:40
    is done.
  • 21:40 - 21:43
    So let's see what we got in here now.
  • 21:43 - 21:47
    OK, cool. So we got these keys. Let's, let's
  • 21:47 - 21:48
    look into one of these. One of the nice
  • 21:48 - 21:50
    things about the sorted set is you can, of
  • 21:50 - 21:53
    course, sort by it. And so the command here
  • 21:53 - 21:56
    is zrevrange. So we can do the 2014 sorted
  • 21:56 - 21:59
    set. So this is gonna give us the top
  • 21:59 - 22:01
    ten, or actually eleven, top eleven, you know,
    ngrams
  • 22:01 - 22:04
    in 2014. So let's see.
  • 22:04 - 22:09
    And we can actually add :with_scores = true.
    So
  • 22:09 - 22:12
    the most common words and phrases from 2014
    RailsConf
  • 22:12 - 22:17
    talk abstracts. Not very surprising. The,
    to, and, a,
  • 22:17 - 22:20
    of, in, you, how. Rails. OK. Rails makes the
  • 22:20 - 22:21
    number ten.
  • 22:21 - 22:24
    So there you go.
  • 22:24 - 22:25
    Now we can also, let's just have a little
  • 22:25 - 22:28
    fun here. See what some of the sort top
  • 22:28 - 22:30
    non-trivial ones are. Obviously you could
    write some code,
  • 22:30 - 22:33
    maybe kill stop words. Stuff like that. If
    you
  • 22:33 - 22:35
    don't care about them.
  • 22:35 - 22:40
    But, so. Rails. Can code. This talk. Most
    popular
  • 22:40 - 22:45
    two-word phrase. Pretty good. How to. Ruby
    developers. Eh,
  • 22:45 - 22:46
    this looks pretty, pretty relevant, right.
    I mean, these
  • 22:46 - 22:51
    are not words you'd be surprised to see in
  • 22:51 - 22:53
    a RailsConf talk abstract.
  • 22:53 - 22:56
    So those, you know, are the most common words.
  • 22:56 - 22:57
    So we now have this. We have this for
  • 22:57 - 22:59
    every year, by the way. So we can also
  • 22:59 - 23:01
    do something, this is the same thing for 2011.
  • 23:01 - 23:04
    Whatever. And the last piece of code we're
    going
  • 23:04 - 23:06
    to write, is we need to be able to
  • 23:06 - 23:07
    query this data.
  • 23:07 - 23:09
    So, you know, the actual, sort of, website
    or
  • 23:09 - 23:12
    finished product, you're gonna have to, you
    know, search
  • 23:12 - 23:13
    for a term. And you're gonna have to go
  • 23:13 - 23:16
    look up in your data, you know, what, what
  • 23:16 - 23:19
    are the relevant values for that term.
  • 23:19 - 23:21
    And so, how we're gonna do this. Well, the
  • 23:21 - 23:23
    first thing we gotta remember is that we normal-
  • 23:23 - 23:27
    remember we did this normalize for ngrams
    thing. So
  • 23:27 - 23:29
    we have to do that again, because what if
  • 23:29 - 23:31
    someone searches for a capitalized word or
    with something
  • 23:31 - 23:33
    with punctuation. We have to process it the
    exact
  • 23:33 - 23:36
    same way that we processed our input. Otherwise
    it
  • 23:36 - 23:39
    won't match. So let's just do that.
  • 23:39 - 23:43
    And then we have this constant ALL_YEARS.
    And we're
  • 23:43 - 23:46
    gonna iterate through that with an object
    with a
  • 23:46 - 23:47
    hash. Let's just build up a hash. That's probably
  • 23:47 - 23:52
    the easy way to do it. Do |year, hash|.
  • 23:52 - 23:58
    And the, the relevant redis command, again,
    is zscore.
  • 23:58 - 24:04
    So we can do redis dot zscore(). We're gonna
  • 24:04 - 24:06
    look up in the hash for that year, the
  • 24:06 - 24:08
    term. And we need to put this actually in
  • 24:08 - 24:14
    the hash. And so, and then we need to
  • 24:14 - 24:16
    to_i that in case it's nil.
  • 24:16 - 24:19
    OK. So this now, what does this say? ALL_YEARS
  • 24:19 - 24:23
    is just, you know, 2007 through 2014. Go through
  • 24:23 - 24:26
    each of those years. And then build up our
  • 24:26 - 24:28
    hash so that the hash, the key of the
  • 24:28 - 24:30
    year, maps to the value of, you know, the
  • 24:30 - 24:34
    number of times that term appeared in that
    year.
  • 24:34 - 24:38
    So let's, again, see if that works. Talk dot
  • 24:38 - 24:44
    query, you know, ruby or something. Cool.
    So in
  • 24:44 - 24:47
    2007 it was mentioned 52 times, 2014 22 times.
  • 24:47 - 24:50
    Whatever. We can, I guess, we said Ember originally.
  • 24:50 - 24:54
    And there you go. It was not mentioned until
  • 24:54 - 24:58
    this year. Which is also kind of telling.
  • 24:58 - 25:02
    And so this is basically, you know, all of
  • 25:02 - 25:04
    the kind of step two code you need. That's
  • 25:04 - 25:07
    sort of the ngram calculation, store the results.
    And
  • 25:07 - 25:10
    again, I reiterate, like, everything we just
    did, is
  • 25:10 - 25:13
    kind of trivially simple. There's no fancy
    algorithms. It's
  • 25:13 - 25:15
    just counting, you know, putting stuff in
    the right
  • 25:15 - 25:17
    data structure. Accessing it in sort of the
    right
  • 25:17 - 25:18
    way.
  • 25:18 - 25:21
    And I just think there's something like pretty,
    you
  • 25:21 - 25:23
    know, insightful about that, that you don't
    need to
  • 25:23 - 25:26
    do fancy things all the time. And that often
  • 25:26 - 25:29
    the kind of the coolest results will come
    from
  • 25:29 - 25:31
    something simple.
  • 25:31 - 25:32
    And so, as I said, the last thing we're
  • 25:32 - 25:33
    gonna do here is create this nice front end
  • 25:33 - 25:36
    interface that lets us investigate the results.
    You know,
  • 25:36 - 25:38
    unfortunately, we don't really have time to
    get into
  • 25:38 - 25:40
    that. It is all on the GitHub. But, I
  • 25:40 - 25:43
    will tell you, I use pie charts as a
  • 25:43 - 25:46
    nice library, front-end library that makes
    it very simple
  • 25:46 - 25:47
    to get charts up and running. It's actually
    not
  • 25:47 - 25:48
    that much code.
  • 25:48 - 25:50
    And I've done this already. So let's start
    up
  • 25:50 - 25:54
    a server. And, oops. Let's fire up the localhost.
  • 25:54 - 25:59
    And so here we are. The abstractogram is our
  • 25:59 - 26:00
    app. So what are we, what are we gonna
  • 26:00 - 26:01
    search for here?
  • 26:01 - 26:04
    Let's see. I, you, we or something. And there
  • 26:04 - 26:05
    we go. So there, there it is. The number
  • 26:05 - 26:09
    of times the word you appears in each year.
  • 26:09 - 26:11
    Looks pretty flat. So, you know, the, these
    are
  • 26:11 - 26:13
    kind of constant. Anyone have any, anything
    else they
  • 26:13 - 26:16
    want to search for? Let's try ember, backbone.
  • 26:16 - 26:19
    All right. Let's say, we got, PosGres I heard.
  • 26:19 - 26:24
    All right. I guess we could all say, let's
  • 26:24 - 26:29
    say SQL. No one cares about PosGres this year.
  • 26:29 - 26:33
    Service. SOA. Oh, there is sort of a rising
  • 26:33 - 26:36
    trend of service-oriented architecture.
  • 26:36 - 26:36
    Anything else?
  • 26:36 - 26:41
    TDD. That's a good one. TDD. Testing. Test-driven,
    how
  • 26:41 - 26:48
    about. So there we go. I'm sorry?
  • 26:49 - 26:54
    Rest. That's a trick one though, cause rest
    is
  • 26:54 - 26:55
    also like a real word that, you know, like,
  • 26:55 - 26:57
    the rest of the time will be something else.
  • 26:57 - 27:04
    And. Refactor. Let's see. Ooh. That's a good
    one.
  • 27:04 - 27:10
    DHH. Wow. Peaked 2011, peak DHH. Let's see,
    we
  • 27:10 - 27:12
    got, Heroku is a good one. On the rise.
  • 27:12 - 27:14
    I like we can just look at Ruby and
  • 27:14 - 27:15
    Rails. This is actually, I think, pretty relevant.
    It's
  • 27:15 - 27:19
    like, what are people talking about? Not Rails
    anymore.
  • 27:19 - 27:20
    We got to find something new to talk about.
  • 27:20 - 27:23
    You know, it's like, too many RailsConfs.
    And, in
  • 27:23 - 27:25
    fact, this actually came up at the, you know,
  • 27:25 - 27:27
    there was a speaker meeting, whatever, and
    everyone was
  • 27:27 - 27:29
    talking about how, you know, their talks weren't
    actually
  • 27:29 - 27:31
    about Rails.
  • 27:31 - 27:33
    And, you know, maybe this is actually an insightful
  • 27:33 - 27:36
    statement, that, you know, the, the community
    has obviously
  • 27:36 - 27:38
    gotten very large and there's just a ton of
  • 27:38 - 27:38
    other stuff to talk about. People have been
    talking
  • 27:38 - 27:41
    about Rails for a long time. And so, you
  • 27:41 - 27:43
    know, here I am giving a talk that's not
  • 27:43 - 27:46
    really directly about Rails. But, so maybe
    this is
  • 27:46 - 27:47
    like a real trend that people are just finding
  • 27:47 - 27:49
    other stuff to talk about.
  • 27:49 - 27:53
    And that is pretty cool. So I promised that
  • 27:53 - 27:56
    I would show you the repo or whatever on
  • 27:56 - 28:00
    GitHub. You can just do bit.ly slash railsconfdata.
    It's
  • 28:00 - 28:02
    just the code. Everything we've looked at
    today. Plus
  • 28:02 - 28:04
    some more stuff. It's actually running live
    on the
  • 28:04 - 28:07
    internet at abstractogram dot herokuapp dot
    com.
  • 28:07 - 28:10
    I figure the internet's probably not working,
    but let's
  • 28:10 - 28:17
    see. Yup. Classic. And, you know, otherwise
    that is
  • 28:17 - 28:20
    it. And thank you for listening. And I think
  • 28:20 - 28:20
    we have time for questions.
Title:
RailsConf 2014 - Demystifying Data Science: A Live Tutorial by Todd Schneider
Description:

more » « less
Duration:
28:48

English subtitles

Revisions