< Return to Video

Ruby Conf 2013 - Thinking about Machine Learning with Ruby by Bryan Liles

  • 0:16 - 0:20
    BRYAN LILES: So, hello. Welcome. I'm Professor
    Liles.
  • 0:20 - 0:22
    You can address me as Professor Liles.
  • 0:22 - 0:27
    And this is CD-612 - A Data Mining Exploration.
  • 0:27 - 0:33
    Get out here. There we go. And the objectives
  • 0:33 - 0:35
    here, and because I'm a professor at a accredited
  • 0:35 - 0:38
    university, I'm just gonna read the slides.
    We're gonna
  • 0:38 - 0:42
    explore the facets of machine learning. We're
    gonna have
  • 0:42 - 0:44
    a data scientist check list. We're gonna also
    talk
  • 0:44 - 0:48
    about the practical applications of converse
    inductive integrals in
  • 0:48 - 0:49
    the context of epsilon.
  • 0:49 - 0:51
    This is real exciting stuff here.
  • 0:51 - 0:55
    There's some prerequisites for this class.
    Basic understanding of
  • 0:55 - 0:57
    statistics. You have to know statistics to
    do machine
  • 0:57 - 1:01
    learning. And data mining. You need to know
    linear
  • 1:01 - 1:02
    algebra. You need to know a little bit of
  • 1:02 - 1:04
    calculus. And you also need to have the ability
  • 1:04 - 1:08
    to embiggen factorials in a cromulent fashion.
  • 1:08 - 1:12
    So let's start off with the review of stuff
  • 1:12 - 1:15
    you should know.
  • 1:15 - 1:17
    Anyone know what this is? And I feel bad,
  • 1:17 - 1:19
    because I gave this, I gave this talk a
  • 1:19 - 1:21
    little while ago, and let me - let me
  • 1:21 - 1:22
    actually profess this. This is actually supposed
    to be
  • 1:22 - 1:25
    Jeff Prudner's spot. I work with Gus in the
  • 1:25 - 1:28
    thunderbolt labs, and he got sick. With some
    third
  • 1:28 - 1:30
    world disease. And we thought it would be
    best
  • 1:30 - 1:33
    if he just not show up.
  • 1:33 - 1:37
    We like Guston. We wish Guston the best. And
  • 1:37 - 1:40
    I'm stepping in just to help him out. Go
  • 1:40 - 1:41
    through the labs.
  • 1:41 - 1:43
    So does anyone know what this is, right here,
  • 1:43 - 1:45
    besides math?
  • 1:45 - 1:47
    By the time I'm done with this talk, you
  • 1:47 - 1:51
    will know what this is. Yeah. Math sucks.
  • 1:51 - 1:56
    So let's talk about something else. Let's
    talk about
  • 1:56 - 1:59
    some background. This is an introduction to
    machine learning.
  • 1:59 - 2:02
    Machine learning is one of those really overloaded
    topics
  • 2:02 - 2:05
    that I prefer to not use that word. So
  • 2:05 - 2:07
    let's not talk about introduction to machine
    learning.
  • 2:07 - 2:10
    Let's talk about an introduction to data mining.
    Because,
  • 2:10 - 2:13
    you know what, statisticians have been data
    mining for
  • 2:13 - 2:18
    the past forty years. This talk is depth versus
  • 2:18 - 2:20
    breadth. I'm not gonna go in deep. I'm just
  • 2:20 - 2:23
    gonna slide across the top. I think it's better
  • 2:23 - 2:26
    for all of us if I do that.
  • 2:26 - 2:29
    Let's talk about machine learning or data
    mining. What
  • 2:29 - 2:31
    can we do with this kind of, with this
  • 2:31 - 2:34
    kind of technology? Everyone in here has problems
    that
  • 2:34 - 2:38
    can be solved with data mining. I was talking
  • 2:38 - 2:40
    to a gentleman earlier from DigitalOcean and
    he kept
  • 2:40 - 2:42
    me, and he was thinking, oh yeah, machine
    learning.
  • 2:42 - 2:46
    I want to be able to detect abuse.
  • 2:46 - 2:49
    You can actually have applications of classification
    to detect
  • 2:49 - 2:52
    abuse. What you can do is you can have
  • 2:52 - 2:54
    your logs come through, and you can actually
    start
  • 2:54 - 2:58
    classifying your logs as good traffic or bad
    traffic.
  • 2:58 - 3:01
    You see this stuff all the time. Spam assassin
  • 3:01 - 3:04
    came out - I think in the 90s.
  • 3:04 - 3:07
    This is an application machine learning. You
    have spam,
  • 3:07 - 3:09
    you have ham. So these are the kind of
  • 3:09 - 3:11
    problems you can have. And, and I've written
    things
  • 3:11 - 3:14
    in the security context where we were detecting
    anomalies
  • 3:14 - 3:17
    and I didn't know it was machine learning
    at
  • 3:17 - 3:19
    the time because, you know, I learned everything,
    I
  • 3:19 - 3:21
    got a internet degree - that's about what
    I
  • 3:21 - 3:24
    have. So I learned it all off of wikipedia.
  • 3:24 - 3:26
    I don't know it was machine learning, but
    come
  • 3:26 - 3:28
    to find out these are kind of things we're
  • 3:28 - 3:30
    gonna talk about in machine learning. One
    thing I
  • 3:30 - 3:31
    need to tell you about is I'm gonna, I
  • 3:31 - 3:34
    might use these words supervised versus unsupervised
    in machine
  • 3:34 - 3:36
    learning. This is very simple.
  • 3:36 - 3:38
    You'll see people talk about, this is an unsupervised
  • 3:38 - 3:42
    algorithm. This is a supervised algorithm.
    This is very
  • 3:42 - 3:45
    simple. In machine learning, you can train,
    you can
  • 3:45 - 3:48
    train these models your algorithms with data.
    There is
  • 3:48 - 3:51
    supervised. Or you can have a model that can
  • 3:51 - 3:55
    actually gain inference just by applying the
    data. That's
  • 3:55 - 3:56
    unsupervised.
  • 3:56 - 3:58
    Simple, simple.
  • 3:58 - 4:01
    I work at Thunderbolt Labs. They're pretty
    awesome. You
  • 4:01 - 4:05
    guys should at least go to our website. Actually,
  • 4:05 - 4:08
    yeah, go to our website, because yeah, our
    website's,
  • 4:08 - 4:11
    I'm pretty proud of it because-
  • 4:11 - 4:17
    Let's see here, where are we? All right. Ah,
  • 4:17 - 4:21
    someone put my face on it.
  • 4:21 - 4:27
    So yeah. Anything with my face on it has
  • 4:27 - 4:30
    got to be awesome. I'm bryanl on Twitter,
    and
  • 4:30 - 4:33
    the standard disclaimer is I do not represent
    Thunderbolt
  • 4:33 - 4:35
    Labs. Except for I do.
  • 4:35 - 4:40
    I do use bold, vulgar words. I'm never misogynistic
  • 4:40 - 4:42
    or anything like that, but I might offend
    you.
  • 4:42 - 4:47
    So follow with me with, just, be careful.
  • 4:47 - 4:52
    And Thunderbolt Labs here, @thunderboltlabs.
    Follow us. We Tweet
  • 4:52 - 4:55
    there sometimes. Here's a really cool thing.
    And this
  • 4:55 - 4:57
    is me getting on my soapbox and stepping on
  • 4:57 - 4:58
    machine learning for just a couple seconds.
  • 4:58 - 5:01
    You see at the bottom of RubyConf's website,
    we
  • 5:01 - 5:03
    are actually sponsoring as a gold sponsor
    at RubyConf
  • 5:03 - 5:05
    this year. And I'll tell you the reason why.
  • 5:05 - 5:08
    I've been - this is my eighth RubyConf. I
  • 5:08 - 5:10
    am not an old-timer, which is crazy. I've
    been
  • 5:10 - 5:11
    coming to RubyConf for eight years and I'm
    not
  • 5:11 - 5:14
    an old-timer. Never once have I, as a black
  • 5:14 - 5:19
    guy, felt intimidated by anyone at any talk
    at
  • 5:19 - 5:21
    any time. So any one saying that RubyConf
    isn't
  • 5:21 - 5:24
    diverse is out of their mind. So off my
  • 5:24 - 5:25
    soap box.
  • 5:25 - 5:30
    And let's move on.
  • 5:30 - 5:34
    So let's talk about required knowledge for
    machine learning.
  • 5:34 - 5:35
    There's a couple things you're gonna need
    to know.
  • 5:35 - 5:38
    You are gonna have to know math. I've presented
  • 5:38 - 5:40
    that equation earlier, which was actually
    an equation for
  • 5:40 - 5:44
    k-means clustering. Actually really simple
    when I explain it
  • 5:44 - 5:45
    to you.
  • 5:45 - 5:47
    You will have to know math. You will have
  • 5:47 - 5:48
    to know a little bit of Calculus. You will
  • 5:48 - 5:49
    have to know a little bit of algebra. You
  • 5:49 - 5:51
    might have to dive into statistics. You will
    need
  • 5:51 - 5:54
    to know these things. But guess what? There
    is
  • 5:54 - 5:56
    easy ways to learn this kind of stuff.
  • 5:56 - 5:59
    You will have to read papers. And I'm, I'm
  • 5:59 - 6:01
    not a fan of papers because I- I went
  • 6:01 - 6:04
    to ClojureConf last year, and every talk was
    like,
  • 6:04 - 6:09
    so I read this paper. Whoa! I mean this
  • 6:09 - 6:10
    is a good paper right here. This is a
  • 6:10 - 6:13
    paper on transactional memory. This is STM.
    This is
  • 6:13 - 6:16
    one of the tenants of Clojure. And it's actually
  • 6:16 - 6:18
    pretty famous - it's actually pretty crafty.
    Cause look
  • 6:18 - 6:20
    at it. Look at the authors. They're like,
    wow,
  • 6:20 - 6:22
    we're not even gonna put two emails on here.
  • 6:22 - 6:26
    We're gonna put them in little curly brackets.
  • 6:26 - 6:28
    You will have to read papers. But you know
  • 6:28 - 6:30
    what, there's nothing wrong. A little tech
    in your
  • 6:30 - 6:32
    life never hurt anyone.
  • 6:32 - 6:35
    You're gonna have to have persistence. This
    is the
  • 6:35 - 6:39
    interesting science. There are different facets
    of machine learning,
  • 6:39 - 6:40
    and depending on who you talk to - you
  • 6:40 - 6:43
    could talk to a statistician, versus more
    of an
  • 6:43 - 6:46
    applied math, math, an applied mathematician,
    you're gonna get
  • 6:46 - 6:48
    different answers about what machine learning
    is.
  • 6:48 - 6:50
    You're just got to be very persistent in the
  • 6:50 - 6:53
    whole entire topic. Because, guess what? This
    is hard.
  • 6:53 - 6:55
    So let's get started.
  • 6:55 - 6:58
    And today I'm going to introduce three topics.
    Regression,
  • 6:58 - 7:02
    classification, and clustering. These are
    three of the bigger
  • 7:02 - 7:04
    tenants of machine learning, and you can solve
    all
  • 7:04 - 7:06
    the problems. I'm here to solve all your problems
  • 7:06 - 7:07
    today.
  • 7:07 - 7:10
    So let's talk about regression.
  • 7:10 - 7:12
    Yes.
  • 7:12 - 7:15
    So does anyone know what regression is? Besides
    black
  • 7:15 - 7:18
    guy in the back? Anyone else? Anyone else?
    You
  • 7:18 - 7:20
    guys know what regression is?
  • 7:20 - 7:22
    Regression is a weird word to me. Cause, you
  • 7:22 - 7:24
    know what, I take, I take the base of
  • 7:24 - 7:28
    this word, regress, and, and regressions really
    are not
  • 7:28 - 7:31
    regressing. Regressions are really just trying
    to figure out,
  • 7:31 - 7:33
    you're trying to predict the value of something
    given
  • 7:33 - 7:34
    some data.
  • 7:34 - 7:37
    A good example of, of regression would be,
    you
  • 7:37 - 7:39
    have a list of data of housing prices, of
  • 7:39 - 7:41
    houses being sold, and you know that the house
  • 7:41 - 7:45
    for $100,000 had 1,000 square feet and two
    bedrooms
  • 7:45 - 7:48
    and the house had, that sold for $150,000
    had
  • 7:48 - 7:52
    1,500 feet and three bedrooms. And you have
    many,
  • 7:52 - 7:53
    many samples of this.
  • 7:53 - 7:54
    So what you can actually do is you can
  • 7:54 - 7:58
    take this data and you can actually somewhat
    accurately
  • 7:58 - 8:02
    predict price based on common criteria.
  • 8:02 - 8:04
    But the, we're gonna focus on linear regression.
    And
  • 8:04 - 8:06
    what linear regression is, is basically we're
    just going
  • 8:06 - 8:10
    to have it move laterally.
  • 8:10 - 8:12
    So let's talk about this. Does anyone know
    what
  • 8:12 - 8:13
    that at the top is? What that equation is
  • 8:13 - 8:15
    at the top?
  • 8:15 - 8:19
    It's the slop of a line, yes. And actually
  • 8:19 - 8:20
    here's a problem that I have, and I have,
  • 8:20 - 8:23
    I have a huge problem with mathematics in
    general.
  • 8:23 - 8:26
    When you are taught this in middle school,
    I
  • 8:26 - 8:29
    think you learn one slope formula in middle
    school,
  • 8:29 - 8:31
    you were not taught anything that looks like
    this.
  • 8:31 - 8:33
    You were taught something that looks more
    like y
  • 8:33 - 8:36
    equals mx plus b. And they have y equals
  • 8:36 - 8:38
    alpha plus beta x. Come on. This is the
  • 8:38 - 8:40
    problem I have with mathematics.
  • 8:40 - 8:42
    Depending on the branch of mathematics you're
    in, they
  • 8:42 - 8:45
    will actually use different symbols for the
    same thing.
  • 8:45 - 8:48
    Talk about persistence. Just gotta bare with
    these guys.
  • 8:48 - 8:50
    And for doing regressions, all you're going
    to do
  • 8:50 - 8:55
    is solve this equation. Come on - this is
  • 8:55 - 8:57
    simple. You're just, you're minimizing the
    functioning tube of
  • 8:57 - 9:00
    a and b, or alpha, beta, where the functioning
  • 9:00 - 9:05
    two alpha, beta equals e squigly e with another
  • 9:05 - 9:10
    kind of e with a hat. It has a
  • 9:10 - 9:13
    hat. Why does it have a hat?
  • 9:13 - 9:15
    All right. This is funny.
  • 9:15 - 9:18
    Really all you're trying to do is, is this.
  • 9:18 - 9:20
    So talk about the line slope formula. You
    have
  • 9:20 - 9:22
    y equals mx plus b, and then I just
  • 9:22 - 9:25
    go through the permutations to get to y equals
  • 9:25 - 9:30
    beta of chai plus alpha.
  • 9:30 - 9:32
    Really all you're doing, is you're drawing
    a line.
  • 9:32 - 9:34
    And what you're going to do in this line,
  • 9:34 - 9:35
    and I'll show you in a second.
  • 9:35 - 9:37
    First you're gonna take this data, and actually
    this
  • 9:37 - 9:40
    data right here is - I, I, I just
  • 9:40 - 9:43
    went for linear, I looked, looked on the internet.
  • 9:43 - 9:48
    I used Google for linear regression data sets.
    And
  • 9:48 - 9:50
    I come across this cool one of the size
  • 9:50 - 9:54
    of brain, the body weight, versus the size
    of
  • 9:54 - 9:56
    a brain. And I plotted. Because that's what
    you
  • 9:56 - 9:57
    do. You plot stuff.
  • 9:57 - 9:59
    I'm just use gonna plot here [00:09:58]. This
    is
  • 9:59 - 10:00
    not Ruby. Sorry there's no Ruby yet in this
  • 10:00 - 10:03
    talk. But I, I just plotted this data. And
  • 10:03 - 10:06
    this is a very interesting data set, because
    it
  • 10:06 - 10:07
    says that it was, it says that it was
  • 10:07 - 10:12
    a real data set, but I don't know. Because
  • 10:12 - 10:14
    if you look over here, I mean. I got
  • 10:14 - 10:15
    a big brain. I'm not gonna lie. I'm a
  • 10:15 - 10:18
    smart dude. But jees.
  • 10:18 - 10:21
    I want to meet this person right here.
  • 10:21 - 10:25
    In a linear regression, well all you're trying
    to
  • 10:25 - 10:27
    do is find, is trying to draw a line
  • 10:27 - 10:30
    that actually goes through the middle of all
    this
  • 10:30 - 10:32
    data. And it's like, well, we can infer that
  • 10:32 - 10:35
    pretty easily as humans because we're, we're
    linear regression
  • 10:35 - 10:36
    monsters.
  • 10:36 - 10:39
    But how do you do that mathemet- or algorithmically?
  • 10:39 - 10:40
    So what you're trying to do is you're trying
  • 10:40 - 10:42
    to calculate something called the error, and
    what this,
  • 10:42 - 10:45
    what that is is the error is the distance
  • 10:45 - 10:48
    between a point and a line. Remember before
    when
  • 10:48 - 10:50
    we were, that minimizing function?
  • 10:50 - 10:52
    You're just trying to find a line that minimizes
  • 10:52 - 10:55
    the distance between this and this for all
    the
  • 10:55 - 10:58
    points on here. This is not the right answer.
  • 10:58 - 11:01
    And just to simplify that a little bit more,
  • 11:01 - 11:03
    this is all you're doing. You're just basically
    trying
  • 11:03 - 11:05
    to find the line that is between the middle
  • 11:05 - 11:06
    of all those points. And we can do it
  • 11:06 - 11:07
    with math.
  • 11:07 - 11:10
    But better yet, we can do it with Ruby.
  • 11:10 - 11:12
    And I put code slides in so I could
  • 11:12 - 11:14
    actually remember to go to my code. So let
  • 11:14 - 11:16
    me hide this bad boy. And there's that pretty
  • 11:16 - 11:17
    guy again.
  • 11:17 - 11:19
    AUDIENCE: [whistles, cat-calls]
  • 11:19 - 11:24
    B.L.: You know what, I like this jacket. There
  • 11:24 - 11:28
    we go. We'll just mirror to this place of
  • 11:28 - 11:30
    getting, little difficult for me to look backwards.
  • 11:30 - 11:32
    So you know what happens is, I had never
  • 11:32 - 11:36
    given this talk with mavericks. And Mavericks
    would get
  • 11:36 - 11:40
    you - it's nice but it'll get you.
  • 11:40 - 11:43
    So we're talking about regression. So what
    I've done
  • 11:43 - 11:45
    here is I've provided some examples, and we
    will
  • 11:45 - 11:48
    look- let's look at the Ruby code. No, actually,
  • 11:48 - 11:50
    let's look at the, the new pot code first,
  • 11:50 - 11:52
    [00:11:50] because this is new pot conf, right.
    Let's
  • 11:52 - 11:56
    look at some gnew plot.
  • 11:56 - 11:58
    So I'm using vem this week. If you know
  • 11:58 - 12:03
    me I actually, I use a lot of editors.
  • 12:03 - 12:06
    And I know this is a Confreaks talk, and
  • 12:06 - 12:09
    there's sometimes some, some like clipping
    on the sides,
  • 12:09 - 12:13
    so what I'm going to do is move this
  • 12:13 - 12:16
    slightly over so we can see everything.
  • 12:16 - 12:19
    So. So let's look at this regression. So what
  • 12:19 - 12:22
    it is, is I've written some, some gnew plat
  • 12:22 - 12:24
    code, and really all I'm doing is making a
  • 12:24 - 12:28
    ping called brain dot ping, and I'm taking,
    I've
  • 12:28 - 12:30
    just setting the x labels and the y labels
  • 12:30 - 12:32
    and I'm setting in a grid and I'm plotting
  • 12:32 - 12:34
    the second and third columns of this brain
    dot
  • 12:34 - 12:35
    csv.
  • 12:35 - 12:37
    Simple, simple.
  • 12:37 - 12:41
    So when I run this gnew plot, and this
  • 12:41 - 12:43
    is how you run gnew plot stuff, you don't
  • 12:43 - 12:47
    even need to type that crap in. And I
  • 12:47 - 12:49
    did it, and it ran really fast cause it's
  • 12:49 - 12:51
    not Ruby - I'm just kidding.
  • 12:51 - 12:54
    And I plot, I type plotted that. So let's
  • 12:54 - 12:56
    go one step further. If I want to actually
  • 12:56 - 12:58
    - let's, let's see if we can do this
  • 12:58 - 13:02
    in Ruby. So, so if we look in this
  • 13:02 - 13:07
    regression dot rb - there we go. This is
  • 13:07 - 13:10
    actually the code for doing regressions in,
    in Ruby.
  • 13:10 - 13:13
    And what you'll notice here is that we have
  • 13:13 - 13:16
    this y-intercept, and we have this slope that
    mx
  • 13:16 - 13:19
    and the b part. And we're just calculating
    those.
  • 13:19 - 13:21
    I actually was gonna write this code, but
    I
  • 13:21 - 13:24
    just searched for simple linear regression
    in Ruby and
  • 13:24 - 13:26
    there was a gist for it. So guess what?
  • 13:26 - 13:30
    Our wall - here's your credit.
  • 13:30 - 13:33
    Why write code that's already been written?
    So all
  • 13:33 - 13:35
    it does, it does the same thing, and at
  • 13:35 - 13:38
    the end, I like to call this train Ruby,
  • 13:38 - 13:40
    because you'll notice that I usually start
    in classes
  • 13:40 - 13:43
    and I end in just a mess. And the
  • 13:43 - 13:45
    reason why is because I wrote this on a
  • 13:45 - 13:47
    train and I was like, typing and then I
  • 13:47 - 13:50
    start looking out the window and got distracted.
    Yeah.
  • 13:50 - 13:52
    AUDIENCE: Ruby in Rails.
  • 13:52 - 13:54
    B.L.: Yeah, this is my, yeah. So this is
  • 13:54 - 13:57
    actually, this file is not too bad. There's
    only
  • 13:57 - 13:59
    like ten lines at the bottom that are kind
  • 13:59 - 14:05
    of crazy. So we'll run this. Regression one
    dot
  • 14:05 - 14:05
    rb.
  • 14:05 - 14:07
    And really what we're looking for are this,
    if
  • 14:07 - 14:09
    you want to plot the line you just need
  • 14:09 - 14:10
    the slope and the y-intercept. So we have
    this
  • 14:10 - 14:13
    number six point six four and six point six
  • 14:13 - 14:15
    seven. Can someone remember those numbers
    for me please?
  • 14:15 - 14:17
    Cause I'm gonna ask for them in a second.
  • 14:17 - 14:21
    And then what we'll do is, because, what we'll
  • 14:21 - 14:25
    do, the last thing we'll do, is we'll actually
  • 14:25 - 14:27
    look in the second regression text file that
    I
  • 14:27 - 14:29
    have here. This is more of the new plot
  • 14:29 - 14:33
    code. And what it's actually doing is, gnew
    plot
  • 14:33 - 14:36
    has linear regression built in. So we'll just
    use
  • 14:36 - 14:39
    theirs. And what their- and all I'm doing
    is
  • 14:39 - 14:42
    instead of writing tests, the message is just,
    I'm
  • 14:42 - 14:45
    using gnew plot to actually figure out, or
    tell
  • 14:45 - 14:47
    me if I have the right answer.
  • 14:47 - 14:48
    So if we look at the bottom of this
  • 14:48 - 14:50
    file, we have an m and a b, right
  • 14:50 - 14:52
    here. And gnew plot does a whole bunch of
  • 14:52 - 14:54
    the craziest things too, but, we actually
    told it
  • 14:54 - 14:55
    to fit the data.
  • 14:55 - 14:57
    And you'll notice the number here is six point
  • 14:57 - 15:00
    six two and six point eight eight. Close to
  • 15:00 - 15:02
    our numbers. I mean our numbers could have
    been
  • 15:02 - 15:03
    better. These numbers are actually a lot better
    because
  • 15:03 - 15:07
    gnew plot actually goes through and figures
    out if,
  • 15:07 - 15:09
    first if the data's linear, and also it gives
  • 15:09 - 15:11
    you error numbers. So this is actually the
    real
  • 15:11 - 15:13
    answer, but our Ruby code is pretty close.
  • 15:13 - 15:17
    And that brings me to a little point here.
  • 15:17 - 15:19
    You will not do a lot of machine learning
  • 15:19 - 15:22
    in Ruby. But because group Ruby is so approachable,
  • 15:22 - 15:24
    and so easy to use, Ruby is a great
  • 15:24 - 15:27
    learning, language to learn how to do machine
    learning
  • 15:27 - 15:28
    before we get to go plot some more later.
  • 15:28 - 15:30
    So just to show you one more thing, we'll
  • 15:30 - 15:36
    plot the output of our, we'll plot the outputs
  • 15:36 - 15:40
    of our files here, and. OK.
  • 15:40 - 15:46
    And we'll open brain regression and there's
    our, that's
  • 15:46 - 15:48
    our file. This is from gnew plot. And we'll
  • 15:48 - 15:55
    open brain_regression_ruby. And there you
    go. They look similar.
  • 15:55 - 15:57
    Actually we'll put them right next to each
    other
  • 15:57 - 16:01
    so you can inspect for yourself. Simple linear
    regression
  • 16:01 - 16:03
    in Ruby. This code will be on GitHub, or
  • 16:03 - 16:04
    actually is on GitHub. You probably can find
    it
  • 16:04 - 16:07
    now if you're, you know, I'm BryanL. But you'll
  • 16:07 - 16:10
    notice that these are the same.
  • 16:10 - 16:12
    And one's with Ruby and one's with, and one's
  • 16:12 - 16:15
    with gnew plot. And this is how simply, and
  • 16:15 - 16:17
    you'll notice if we go back to our file
  • 16:17 - 16:23
    here, linear regression one dot ruby.
  • 16:23 - 16:27
    It's not many lines. Actually it's, it's like
    forty-seven
  • 16:27 - 16:31
    lines. Let's say it's forty-two lines. It's
    forty-two lines.
  • 16:31 - 16:35
    So that's regression. And when I, what I've
    basically
  • 16:35 - 16:38
    shown you is that it's easy in Ruby to
  • 16:38 - 16:41
    plot lines and, because you can plot lines
    you
  • 16:41 - 16:43
    can do simple linear regression.
  • 16:43 - 16:45
    Let's talk about classification next. And
    before we talk
  • 16:45 - 16:47
    about classification, I want to show you something
    here.
  • 16:47 - 16:51
    I'll go to my web-browser - that guy, keeps
  • 16:51 - 16:52
    on coming back.
  • 16:52 - 16:55
    I want to show you something. There we go.
  • 16:55 - 16:57
    Let's make this a little bit smaller. My computer's
  • 16:57 - 17:01
    pretty smart. I, I wrote this Sinatra app
    called
  • 17:01 - 17:04
    the number game, and I'll, and I'll reload
    it
  • 17:04 - 17:06
    a couple times so you can see what's going
  • 17:06 - 17:13
    on. So here's once. Here's again. And so what's
  • 17:14 - 17:16
    happening here is there's a dataset called
    the inthis
  • 17:16 - 17:19
    dataset, and all it is is a collection of
  • 17:19 - 17:22
    60,000 hand-written numbers. And what I did
    was I
  • 17:22 - 17:24
    wrote a simple app to actually go through
    and
  • 17:24 - 17:25
    recognize numbers.
  • 17:25 - 17:28
    Simple application on machine learning. This
    is what you
  • 17:28 - 17:33
    call classification. We are classifying these
    bit, these pixels
  • 17:33 - 17:36
    as a number. And, you notice that for most
  • 17:36 - 17:42
    part, I believe that this classifier is sixty-five
    percent,
  • 17:42 - 17:46
    maybe seventy-five percent correct. I don't
    really remember.
  • 17:46 - 17:48
    So how do we do stuff like this? Well
  • 17:48 - 17:51
    first of all, with classification, let's just
    go back
  • 17:51 - 17:53
    to my slides.
  • 17:53 - 17:57
    How do we classify? Well, a classification,
    classification's actually
  • 17:57 - 17:59
    really simple. We're basically doing the same
    thing we're
  • 17:59 - 18:03
    doing in regression, but we have way more
    dimensions.
  • 18:03 - 18:06
    So in that, in that other file here we
  • 18:06 - 18:09
    have an image, and this image I do know
  • 18:09 - 18:10
    is 28 by 28.
  • 18:10 - 18:13
    And I gave this talk before, I asked somebody,
  • 18:13 - 18:18
    does anyone know what 28 by 28 is? 28
  • 18:18 - 18:20
    times 28?
  • 18:20 - 18:23
    When I was in Boston, someone yelled the answer
  • 18:23 - 18:25
    out before I finished typing it. He must have
  • 18:25 - 18:26
    a special mind.
  • 18:26 - 18:30
    But it's, so, this, this particular file has
    784
  • 18:30 - 18:36
    pixels. So, that we have 784 pix- 784 features
  • 18:36 - 18:38
    that we can classify this document against.
    And really
  • 18:38 - 18:41
    what we're doing is we're, in memory, just
    drawing
  • 18:41 - 18:43
    a line and trying to find out, we're just
  • 18:43 - 18:46
    predicting, basically, what our image is.
  • 18:46 - 18:49
    So without further ado, this is RubyConf.
    Let's see
  • 18:49 - 18:51
    some code.
  • 18:51 - 18:55
    Let's see, classification.
  • 18:55 - 18:58
    So what I have here, I'm actually gonna run
  • 18:58 - 18:59
    - I wrote this classifier. I'll run it real
  • 18:59 - 19:01
    quick and then I'll show you what's in there.
  • 19:01 - 19:03
    SO what it's doing is I have sixty thousand
  • 19:03 - 19:05
    images that are twenty eight by twenty eight,
    and
  • 19:05 - 19:07
    I have sixty thousand labels, and that's the
    reason
  • 19:07 - 19:09
    I know is I'm right or- that's what I
  • 19:09 - 19:10
    know what it is.
  • 19:10 - 19:12
    Because it, the data is labeled.
  • 19:12 - 19:13
    What I'm doing right now is I'm running a
  • 19:13 - 19:15
    trainer on it. And in this case, I'm using
  • 19:15 - 19:18
    something called support, support vector machines.
    I find this
  • 19:18 - 19:20
    easier to use it and then tell you what
  • 19:20 - 19:22
    it is. So really what I'm doing here, is
  • 19:22 - 19:26
    I'm classifying this data using a support
    vector machine.
  • 19:26 - 19:28
    And then I'm predicting the data. So I have
  • 19:28 - 19:31
    a sixty thousand dollar- or, sixty thousand,
    a sixty
  • 19:31 - 19:34
    thousand count training set convert - this
    is supervised
  • 19:34 - 19:41
    learning. And I have a ten thousand count
    test
  • 19:41 - 19:43
    set. And what it's doing now is it's actually
  • 19:43 - 19:47
    going through and seeing how accurate the
    classifier was.
  • 19:47 - 19:49
    And what we find here is that I'm only
  • 19:49 - 19:51
    about sixty-five percent accurate.
  • 19:51 - 19:53
    And, and the reason why is because you would
  • 19:53 - 19:55
    actually, I would actually have to go through,
    if
  • 19:55 - 19:57
    I can actually train all sixty thousand of
    those
  • 19:57 - 20:00
    things, and Ruby, because of my global interpreter
    block,
  • 20:00 - 20:02
    I only have one core of usage. So I
  • 20:02 - 20:04
    actually tried to do it - I stopped three
  • 20:04 - 20:06
    days cause it was just so slow, cause it
  • 20:06 - 20:08
    was just using one core.
  • 20:08 - 20:13
    Really, but what it's doing is actually, is
    going
  • 20:13 - 20:16
    through and it's saying, this is the name.
    The
  • 20:16 - 20:18
    computer says, OK. I know that. This is a
  • 20:18 - 20:20
    four. OH, that's a nice four. I know that.
  • 20:20 - 20:22
    This is the four. And then they're all looking
  • 20:22 - 20:24
    different. And when I go back through with
    the
  • 20:24 - 20:26
    test set, all I'm doing is going, then, is
  • 20:26 - 20:28
    saying well I think this is, and I think
  • 20:28 - 20:29
    this is.
  • 20:29 - 20:33
    So let's look at some more train Ruby. So
  • 20:33 - 20:35
    this is, and I'll show you a good example
  • 20:35 - 20:37
    of this. I was really into this code when
  • 20:37 - 20:39
    I was starting and what I - I'll just
  • 20:39 - 20:40
    go through and I'll talk about it. I have
  • 20:40 - 20:42
    a data set, and then I have this loader
  • 20:42 - 20:44
    which actually just loads the data from the
    files.
  • 20:44 - 20:47
    The files, it's a binary format and it's called
  • 20:47 - 20:49
    gsub, so ignore all this.
  • 20:49 - 20:51
    And then what I do is I load the
  • 20:51 - 20:52
    labels, and the labels are along with the
    file
  • 20:52 - 20:55
    and it basically says this blob of data is
  • 20:55 - 20:57
    a four, this blob of data is a five.
  • 20:57 - 20:59
    And then I start looking out the window on
  • 20:59 - 21:02
    the train.
  • 21:02 - 21:04
    So really what I've done here is, so the
  • 21:04 - 21:06
    line at the top is I'm, I'm basically setting
  • 21:06 - 21:09
    a timer and I'm loading the data and then
  • 21:09 - 21:12
    I'm, I'm actually going through and I'm using
    something
  • 21:12 - 21:14
    called libsvm, which I'll talk about in a
    second.
  • 21:14 - 21:16
    And I'm classifying all the data. This is
    a
  • 21:16 - 21:20
    really, really important lesson. If anyone
    in here is
  • 21:20 - 21:25
    writing machine learning algorithms, you don't
    belong here. You
  • 21:25 - 21:28
    belong doing something very important. We
    have to, you,
  • 21:28 - 21:30
    in machine learning you will stand on the
    back
  • 21:30 - 21:33
    of giants. You will not write machine learning
    code.
  • 21:33 - 21:35
    You will use things like shark and spark.
  • 21:35 - 21:37
    Or you will use mahout, or you will use
  • 21:37 - 21:40
    libsvm, or you will use ar4r, but you will
  • 21:40 - 21:42
    not write this code. And I'm, what I'm doing
  • 21:42 - 21:44
    today is I'm trying to introduce that to you
  • 21:44 - 21:46
    so that you can see how easy it is
  • 21:46 - 21:50
    to write something like this.
  • 21:50 - 21:52
    This took me about twenty minutes to write.
    Really
  • 21:52 - 21:55
    it's just a Sinatra app. Not even, it's not
  • 21:55 - 21:57
    even a Rails app. It's a Sinatra app. And
  • 21:57 - 21:59
    it just throws an image up and, and that's
  • 21:59 - 22:01
    it.
  • 22:01 - 22:05
    I'm pretty proud of this. I'm sorry.
  • 22:05 - 22:10
    So moving on. So we talked about, we, we're
  • 22:10 - 22:12
    talked about - what we did here is linear
  • 22:12 - 22:16
    classification. And we use a support vector
    machine, and
  • 22:16 - 22:17
    there was code. I jumped ahead.
  • 22:17 - 22:20
    And there's also something called classification
    with the decision
  • 22:20 - 22:22
    tree, and I was gonna talk about this, but
  • 22:22 - 22:24
    I was looking through RubyConf talks last
    year -
  • 22:24 - 22:27
    I actually don't go to the talks at RubyConf
  • 22:27 - 22:28
    for some reason. But I saw, I actually watched
  • 22:28 - 22:31
    this whole talk. This guy, Chris Nelson did
    a
  • 22:31 - 22:33
    talk in Denver at this same conference last
    year,
  • 22:33 - 22:35
    and this is actually a pretty good talk on
  • 22:35 - 22:38
    decision trees. He goes to forty minutes on
    one
  • 22:38 - 22:39
    topic, and it's pretty good.
  • 22:39 - 22:42
    So basically decision trees are this. You
    basically learn
  • 22:42 - 22:46
    how to divide your data into else clauses.
    And
  • 22:46 - 22:47
    you do it as a tree, and you just
  • 22:47 - 22:49
    go down until you get more and more specific.
  • 22:49 - 22:51
    You'll find that there's actually software
    out there that
  • 22:51 - 22:56
    builds huge nested trees and that's how they're
    doing
  • 22:56 - 22:57
    their classifications.
  • 22:57 - 23:01
    So here, let's talk about clustering. What
    can we
  • 23:01 - 23:03
    use clustering for? Well we can use it to
  • 23:03 - 23:06
    group documents. Group like documents. We
    can use it
  • 23:06 - 23:09
    to detect plagiarism.
  • 23:09 - 23:11
    This is actually a new part of the talk.
  • 23:11 - 23:16
    And I wanted to do some live coding. So
  • 23:16 - 23:20
    what I'm going to do here is go backwards
  • 23:20 - 23:25
    in my Pry session and there's something called
    Jacquard's
  • 23:25 - 23:27
    constant, or coefficient.
  • 23:27 - 23:29
    And what it does is it allows you to
  • 23:29 - 23:32
    take a set of data and classify it again
  • 23:32 - 23:34
    - or, or compare it against another set of
  • 23:34 - 23:37
    data. And basically what we can do with this
  • 23:37 - 23:39
    is we can actually do- build a tool that
  • 23:39 - 23:41
    can detect plagiarism. And let me show you
    how
  • 23:41 - 23:42
    easy this is.
  • 23:42 - 23:44
    Start require Jacquard, and what I went and
    did
  • 23:44 - 23:48
    on the internet is I went and search for
  • 23:48 - 23:51
    some text. So what I'm going to do here
  • 23:51 - 23:53
    is I'm gonna cut and paste this text on
  • 23:53 - 23:56
    this web paste, plagiarize it, and I'm going
    to
  • 23:56 - 23:58
    put it in a variable. And then what I'm
  • 23:58 - 24:00
    going to do is make a variable called a
  • 24:00 - 24:01
    chunk.
  • 24:01 - 24:05
    This is another little quibble I have about
    Ruby.
  • 24:05 - 24:06
    So you have a two, you do a whole
  • 24:06 - 24:09
    bunch of underscore case, you can't tell me
    that
  • 24:09 - 24:12
    doesn't look better than this - I'm sorry.
    I
  • 24:12 - 24:14
    like this better. Maybe I, I must have done
  • 24:14 - 24:16
    Java and Scala too much.
  • 24:16 - 24:17
    But, so what we're going to do is we're
  • 24:17 - 24:20
    gonna chunk up that a, so we can do
  • 24:20 - 24:23
    a dot split and we can do like this.
  • 24:23 - 24:29
    This is what we'll call poor man's nlp. And
  • 24:29 - 24:33
    now we have an array with, with our data
  • 24:33 - 24:35
    in it. And what we'll do now is we'll
  • 24:35 - 24:39
    cop- we'll go down here in another example
    and
  • 24:39 - 24:42
    we'll copy, this is supposedly plagiarized
    text. So we'll
  • 24:42 - 24:45
    just go b equals this - I'll consider it
  • 24:45 - 24:48
    a function to do this. I did not.
  • 24:48 - 24:51
    And this is live. We're on tv here. So.
  • 24:51 - 24:54
    We'll do it all at one time. So now
  • 24:54 - 24:57
    we have this. And now for c, I'm just
  • 24:57 - 25:04
    gonna say-
    that's my new paper. So we have
  • 25:05 - 25:08
    a, b, c, and we gotta chunk up c
  • 25:08 - 25:15
    too into- so, so now what we're going to
  • 25:17 - 25:19
    do is, I have to go back here.
  • 25:19 - 25:26
    There we go. Is we're going to find which
  • 25:26 - 25:33
    two papers match each other. Oh. This is live
  • 25:34 - 25:38
    coding, and I'm not cheating this time.
  • 25:38 - 25:41
    Thank you.
  • 25:41 - 25:48
    Now see this is a group effort.
  • 25:48 - 25:51
    There we go. All right. So what it did
  • 25:51 - 25:54
    is it, it actually returned the whole entire
    array.
  • 25:54 - 25:56
    But if I go through this code and, cause
  • 25:56 - 25:58
    I'm in Pry, you'll notice that it returns
    two
  • 25:58 - 26:01
    arrays, and the two arrays are the, is the,
  • 26:01 - 26:04
    is the original copy and the plagiarized copy.
    Notice
  • 26:04 - 26:06
    that my copy, it's saying that these two documents
  • 26:06 - 26:10
    are similar. And what we can also do here,
  • 26:10 - 26:11
    and I'll move this to the top in a
  • 26:11 - 26:11
    second-
  • 26:11 - 26:14
    Yes, sir.
  • 26:14 - 26:17
    AUDIENCE: indecipherable - 00:26:17
  • 26:17 - 26:22
    B.L.: Oh, sorry about that. So we'll, we'll,
    we'll,
  • 26:22 - 26:26
    we'll to the com- coefficients, and this is
    Jacquard's
  • 26:26 - 26:31
    coefficient, we'll generate coefficients for,
    so notice these numbers
  • 26:31 - 26:35
    are point o one five and, and if we
  • 26:35 - 26:39
    go and see, you notice that this number is
  • 26:39 - 26:41
    a lot lower - point oh one. What we
  • 26:41 - 26:43
    can do with Jacquard's coefficient is we can
    actually
  • 26:43 - 26:46
    compare documents based on how similar these
    numbers are.
  • 26:46 - 26:51
    Once again, simple plagiarism detection. We
    call it classification
  • 26:51 - 26:52
    in machine learning. Yes sir.
  • 26:52 - 26:54
    AUDIENCE: indecipherable - 00:26:53
  • 26:54 - 26:56
    B.L.: Man, I don't know.
  • 26:56 - 27:00
    I actually, I do know, but I'm not gonna
  • 27:00 - 27:04
    tell you. Go look on Wikipedia. But, no, we
  • 27:04 - 27:06
    can talk about it in a second. I mean
  • 27:06 - 27:09
    after this talk.
  • 27:09 - 27:11
    So the next thing is kmeans clustering, and
    this
  • 27:11 - 27:13
    is the fancy topic that, that formula that
    I
  • 27:13 - 27:16
    showed you earlier was k-means clustering.
    And what you
  • 27:16 - 27:18
    can do with k-means clustering is something
    neat. So
  • 27:18 - 27:21
    imagine that you have a scatter plot of data.
  • 27:21 - 27:23
    And I have a scatter plot of data.
  • 27:23 - 27:26
    Uh-oh. Uh-oh.
  • 27:26 - 27:33
    I'm having technical difficulties. All right,
    I mean like
  • 27:34 - 27:40
    seriously. Keynote thirteen crash going-
  • 27:40 - 27:45
    All right, so we'll reload keynote.
  • 27:45 - 27:49
    Mhmm. So imagine we have some dots. And pretend
  • 27:49 - 27:54
    these dots right here are star-filled maps.
    So pictures,
  • 27:54 - 27:57
    you know, two-dimensional pictures of distant
    galaxies. And we
  • 27:57 - 27:59
    want to take those galaxies, and we want to
  • 27:59 - 28:02
    cluster them into groups, because we're gonna
    say that
  • 28:02 - 28:04
    the despars that are in post group are a
  • 28:04 - 28:07
    galaxy. So this is not Ruby, this is actually
  • 28:07 - 28:08
    Javascript.
  • 28:08 - 28:10
    but so let me show you this. I'm gonna
  • 28:10 - 28:12
    show you what the computer can do. So I'm
  • 28:12 - 28:14
    gonna reload this page, and you're gonna notice
    that
  • 28:14 - 28:17
    the, the, the colors are a little bit off,
  • 28:17 - 28:24
    but just wait a second here.
  • 28:24 - 28:29
    There you go. Did you see that? So what
  • 28:29 - 28:31
    I've done here is, and you're gonna like,
    Bryan
  • 28:31 - 28:32
    what'd you do? So what I've done here is,
  • 28:32 - 28:34
    you see these bigger dots? These bigger dots
    are
  • 28:34 - 28:37
    called centraler in k-means plus three. And
    what I've
  • 28:37 - 28:39
    done, doing, is I've put them on the screen
  • 28:39 - 28:42
    randomly. And what I've done is I colored
    the,
  • 28:42 - 28:46
    the little dots based on the closest big dot,
  • 28:46 - 28:47
    and then what I do is I move the
  • 28:47 - 28:51
    big dot to the center of all the, all
  • 28:51 - 28:54
    the similar colored dots.
  • 28:54 - 28:55
    And then I do it again. And then I
  • 28:55 - 28:57
    do it again. Then I do it again. Until
  • 28:57 - 29:00
    it stops moving. And then when it stops moving
  • 29:00 - 29:02
    I've detected things that are similar. So
    you'll notice
  • 29:02 - 29:06
    that down here in the bottom corner here that
  • 29:06 - 29:08
    we can detect with our eyes that this is
  • 29:08 - 29:10
    a different group, and we can detect with
    our
  • 29:10 - 29:12
    eyes that this is a different group. With
    k-means
  • 29:12 - 29:14
    clustering, you actually - usually what they'll
    do is
  • 29:14 - 29:17
    they'll run it hundreds, if not thousands
    of times,
  • 29:17 - 29:21
    and they'll actually generate a good sample
    here.
  • 29:21 - 29:23
    Yup.
  • 29:23 - 29:25
    We'll generate a good sample here and what
    we'll
  • 29:25 - 29:27
    be able to do is use averages over time
  • 29:27 - 29:30
    to actually determine what the probability
    of a group,
  • 29:30 - 29:33
    of a, a little dot being in a group
  • 29:33 - 29:34
    with a big dot.
  • 29:34 - 29:39
    So, let's see something here. So let's go
    up.
  • 29:39 - 29:45
    And we're talking about clustering. And let's
    just get
  • 29:45 - 29:46
    k-means.
  • 29:46 - 29:50
    This app is actually not, this is actually
    a
  • 29:50 - 29:54
    j, a ds, a d3 app.
  • 29:54 - 29:55
    I'm not gonna talk about the code. I just
  • 29:55 - 29:57
    want to show you how much code it is,
  • 29:57 - 30:01
    and this is why we use Ruby, cause, come
  • 30:01 - 30:02
    on. Yup.
  • 30:02 - 30:04
    Come on. Yup.
  • 30:04 - 30:09
    Come on. Yup. Right. We're almost there - wait,
  • 30:09 - 30:11
    wait, wait, wait, wait.
  • 30:11 - 30:12
    Yup. OK we're there.
  • 30:12 - 30:15
    So the reason I show that is because this
  • 30:15 - 30:17
    brings me into a new topic. And like I
  • 30:17 - 30:22
    said, I'm not - oh, my gosh. It did
  • 30:22 - 30:23
    crash.
  • 30:23 - 30:24
    I'm not here to show you how to do
  • 30:24 - 30:26
    machine learning. I'm trying to show you what
    machine
  • 30:26 - 30:31
    learning is. And practical applications of
    machine learning. So
  • 30:31 - 30:36
    I have nine minutes and forty-four seconds,
    and let's...
  • 30:36 - 30:38
    go... to here.
  • 30:38 - 30:42
    So let's talk about doing machine learning
    in Ruby.
  • 30:42 - 30:47
    God. There. There is a practice called AI4R.
    You
  • 30:47 - 30:50
    could gem install AI4R right now. It implements
    a
  • 30:50 - 30:54
    lot of things that I was talking about earlier.
  • 30:54 - 30:57
    Here's a picture of their web page. There's
    something
  • 30:57 - 31:01
    called, there's a project called SciRuby out
    there, and
  • 31:01 - 31:02
    it is a lot of scientific stuff. So a
  • 31:02 - 31:05
    lot of linear algebra things that you would
    need
  • 31:05 - 31:06
    for machine learning.
  • 31:06 - 31:09
    But the problem with SciRuby is that it's
    more
  • 31:09 - 31:11
    like a science experiment and it doesn't have
    the
  • 31:11 - 31:14
    funding it needs to be complete. So people
    are
  • 31:14 - 31:18
    using it but there's much better things out
    there.
  • 31:18 - 31:22
    You can use JRuby and Mahout. Apache Mahout
    is
  • 31:22 - 31:25
    actually one of the defacto machine learning
    packages out
  • 31:25 - 31:26
    there, but it's written in, on, in Java on
  • 31:26 - 31:29
    the JVM. If you use JRuby you can actually
  • 31:29 - 31:31
    get pretty easy access to it.
  • 31:31 - 31:33
    And because it's so popular, someone wrote
    a JRuby
  • 31:33 - 31:36
    Mahout plugin, but you can tell that they
    were
  • 31:36 - 31:39
    really interested in it, because the last
    time they
  • 31:39 - 31:41
    looked at it was eleven months ago.
  • 31:41 - 31:45
    So. What I'm gonna do now is I'm gonna
  • 31:45 - 31:49
    rail on Ruby for just a second. I love
  • 31:49 - 31:52
    Ruby. I write Ruby every day in some fashion
  • 31:52 - 31:54
    or another, and I use it for a lot.
  • 31:54 - 31:57
    But Ruby is not good for machine learning
    because
  • 31:57 - 31:59
    Ruby is not fast when it comes to math.
  • 31:59 - 32:01
    The project and some of the things inside
    of
  • 32:01 - 32:03
    there were kind of, are trying to fix it,
  • 32:03 - 32:06
    but the problem is, even getting to that stage
  • 32:06 - 32:08
    of installing this stuff is hard. Cause to
    install
  • 32:08 - 32:11
    SciRuby you need an install called Atlas.
  • 32:11 - 32:14
    Which is damn near impossible if you have
    a
  • 32:14 - 32:14
    Mac.
  • 32:14 - 32:16
    We also don't have easy plotting in Ruby.
    You
  • 32:16 - 32:18
    notice that I used gnew plot and d3 for
  • 32:18 - 32:21
    all my, my, my plotting examples. There is
    a
  • 32:21 - 32:23
    Ruby gnew plot gem out there, and you can
  • 32:23 - 32:26
    use it. But Ruby does not have great native
  • 32:26 - 32:30
    plotting stuff like map plot let in Python.
  • 32:30 - 32:34
    There is no integrated environment. Ruby has
    py- I
  • 32:34 - 32:36
    mean, IRB, and you, I was using Pry earlier.
  • 32:36 - 32:38
    But there's not a great, there's not like
    a
  • 32:38 - 32:43
    Math lab or a Mathematica, or NI Python for
  • 32:43 - 32:46
    Ruby. We really do need this.
  • 32:46 - 32:49
    And also, I want to be able to do
  • 32:49 - 32:51
    what I need to do as a scientist. I
  • 32:51 - 32:52
    want to be like a scientist. As a scientist,
  • 32:52 - 32:55
    I don't want to be a programmer. Ruby just
  • 32:55 - 32:58
    still does not have the maturity in the idioms
  • 32:58 - 33:01
    to allow scientists to be scientists. This
    is why
  • 33:01 - 33:04
    all the scientists either use snap plot or
    they
  • 33:04 - 33:08
    use, not math plot, mount lab or they use
  • 33:08 - 33:11
    Python. Because it allows them to still be
    scientists
  • 33:11 - 33:13
    while letting them get stuff done.
  • 33:13 - 33:16
    So I don't want to rail on Ruby. What
  • 33:16 - 33:17
    do you want to do if you want to
  • 33:17 - 33:19
    learn more? Because really I cannot give you
    guys
  • 33:19 - 33:22
    an advance talk. I could actually talk on
    one
  • 33:22 - 33:25
    subject, like, classifiers, for this whole
    entire time. But
  • 33:25 - 33:27
    if you, I want you to learn linear algebra.
  • 33:27 - 33:28
    And you know I've probably said this three
    times
  • 33:28 - 33:31
    already. You need to learn linear algebra.
    It's not
  • 33:31 - 33:33
    hard. There's a coursera class on it - it's
  • 33:33 - 33:34
    actually, it's not hard.
  • 33:34 - 33:37
    You need to learn calculus. You notice I had
  • 33:37 - 33:39
    a minimizing function earlier. You need to
    learn how
  • 33:39 - 33:41
    to minimize. You need to know some kind of
  • 33:41 - 33:43
    statistics. My friend Randall in the back
    will tell
  • 33:43 - 33:45
    you that you need to know statistics just
    in
  • 33:45 - 33:46
    general.
  • 33:46 - 33:47
    But you need to learn some statistics. You
    at
  • 33:47 - 33:50
    least need to know how mean, min, max work,
  • 33:50 - 33:53
    how standard deviation works, and a few things
    around
  • 33:53 - 33:54
    there.
  • 33:54 - 33:58
    There's a coursera class on machine learning
    from the
  • 33:58 - 34:02
    guy at Standford. I would say watch it. It's
  • 34:02 - 34:04
    kind of dry. But at least it'll give you
  • 34:04 - 34:06
    some of the, it'll give you some of the
  • 34:06 - 34:09
    things you need to know. My partner in the
  • 34:09 - 34:11
    back, Randall, has actually written a great
    blog post
  • 34:11 - 34:14
    on getting into machine learning that he's
    gonna post.
  • 34:14 - 34:16
    He has to now because I just said it
  • 34:16 - 34:17
    in public.
  • 34:17 - 34:19
    And another thing you want to do is use
  • 34:19 - 34:22
    Wikipedia. I tell you what. I gotta show you
  • 34:22 - 34:24
    this. Five minutes. We'll show you this.
  • 34:24 - 34:28
    So Wikipedia is amazing. Anything you want
    to know,
  • 34:28 - 34:32
    some smart dude or lady has written it on
  • 34:32 - 34:33
    Wikipedia. So if you want to look up support
  • 34:33 - 34:40
    vector machine, yes.
  • 34:41 - 34:43
    There it is.
  • 34:43 - 34:47
    Some nice person has written how it works.
    If
  • 34:47 - 34:52
    you want to look up, k-means clustering, some
    smart
  • 34:52 - 34:56
    person has, has written how it all works.
    You
  • 34:56 - 34:59
    need to- Wikipedia is better than any textbook,
    I'm
  • 34:59 - 35:00
    not lying.
  • 35:00 - 35:01
    You also need to read this book by Peter
  • 35:01 - 35:03
    Flach. There's a lot of books on machine learning
  • 35:03 - 35:05
    out there, and this one's hard to buy on
  • 35:05 - 35:07
    Amazon and it might be, like, I don't know
  • 35:07 - 35:10
    how much it is. Maybe it's fifty, sixty bucks.
  • 35:10 - 35:12
    There's not many books that will say are great.
  • 35:12 - 35:16
    The first chapter of this book is probably
    the
  • 35:16 - 35:19
    best reference on machine learning out there.
  • 35:19 - 35:20
    And I'll, I'll wait for a second so you
  • 35:20 - 35:23
    guys can take this in. This first chapter
    of
  • 35:23 - 35:25
    this book by Peter Flach is the best thing
  • 35:25 - 35:27
    I've seen on machine learning.
  • 35:27 - 35:28
    But if you want to get serious, I want
  • 35:28 - 35:31
    you to find a data set, find another language,
  • 35:31 - 35:33
    unfortunately. If you're really serious about
    this you're not
  • 35:33 - 35:36
    gonna be doing any Ruby. Then you'll want
    to
  • 35:36 - 35:37
    do the dot dot dot dance, and then maybe
  • 35:37 - 35:38
    you'll profit.
  • 35:38 - 35:40
    But I'll tell you what. We haven't even scratched
  • 35:40 - 35:44
    the surface. Sidekiq, which is, Sidekiq learn
    which is,
  • 35:44 - 35:48
    in Python land, actually has this huge chart
    of
  • 35:48 - 35:49
    all the things you can do in machine learning,
  • 35:49 - 35:52
    we only talked about three things on this
    list,
  • 35:52 - 35:54
    and look at all these little circles and lines.
  • 35:54 - 35:55
    There's so much more you could talk about
    in
  • 35:55 - 35:57
    machine learning.
  • 35:57 - 35:59
    I want you to go look at BigML. It's
  • 35:59 - 36:02
    actually a great machine learning platform,
    and it's a
  • 36:02 - 36:05
    lot of tutorials on there. And Dundas. This
    is
  • 36:05 - 36:08
    this website. Their blog has interesting every
    once in
  • 36:08 - 36:11
    awhile. Kaggle has contests, data sets, and
    an interesting
  • 36:11 - 36:12
    blog.
  • 36:12 - 36:14
    You might want to go look in Python land.
  • 36:14 - 36:19
    NumPy, SciPy, ScikitLearn and MapPlotlib are
    godsends. And on
  • 36:19 - 36:22
    Python look at Apache Mahout. And there's
    a newcomer
  • 36:22 - 36:24
    on the block - I'm just throwing things out
  • 36:24 - 36:29
    there for future reference. Shark with Spark.
    PredicitionIO -
  • 36:29 - 36:31
    this is a newcomer. It does prediction tools.
    It's
  • 36:31 - 36:33
    actually pretty neat. I got a data set in
  • 36:33 - 36:37
    there and I was not unimpressed.
  • 36:37 - 36:38
    But the cool thing about it is that it's
  • 36:38 - 36:41
    only building on, on technologies we already
    have. I
  • 36:41 - 36:43
    mean it's not a new thing. It is using
  • 36:43 - 36:45
    Padu. It is using Mahout.
  • 36:45 - 36:50
    Here's some slides. Here's where the, the
    code for
  • 36:50 - 36:52
    this talk is, if you're really into that kind
  • 36:52 - 36:54
    of stuff. No one really loves this stuff,
    so
  • 36:54 - 36:56
    I'll put it up there for posterity.
  • 36:56 - 36:58
    But I want to show you something with the
  • 36:58 - 37:00
    last couple minutes I have here.
  • 37:00 - 37:03
    So this is what Ruby needs. If, If I
  • 37:03 - 37:05
    were to walk around to everyone and say, you
  • 37:05 - 37:07
    know what, I can tell you what Ruby needs.
  • 37:07 - 37:10
    Ruby needs IPython. Has anyone here seen IPython?
  • 37:10 - 37:14
    All right, I'm about to blow your guys's minds.
  • 37:14 - 37:19
    So IPython is basically Pry on a webpage.
    And
  • 37:19 - 37:24
    because I'm so awesome, I'm just gonna type
    in
  • 37:24 - 37:26
    some Python.
  • 37:26 - 37:30
    I'm just kidding.
  • 37:30 - 37:38
    I'm not gonna do that.
  • 37:38 - 37:42
    So one plus one equals two. You're actually
    writing
  • 37:42 - 37:45
    Python in your web browser here. This is,
    this
  • 37:45 - 37:49
    is a transformational tool. We could actually
    do this
  • 37:49 - 37:51
    better, and I'm not saying that we should
    just
  • 37:51 - 37:54
    go steal this, but you know what, rack got
  • 37:54 - 37:57
    us really, really far. And you nice guys know
  • 37:57 - 38:01
    where Rack came from, right? From Python.
  • 38:01 - 38:04
    Another thing I want to look at, and this'll
  • 38:04 - 38:08
    be the last thing, is - oh actually, you
  • 38:08 - 38:10
    know what, I'm done.
  • 38:10 - 38:14
    Thank you.
Title:
Ruby Conf 2013 - Thinking about Machine Learning with Ruby by Bryan Liles
Description:

more » « less
Duration:
38:38

English subtitles

Revisions