Ruby Conf 2013 - Recommendation Engines with Redis and Ruby by Evan Light

Edit subtitles

0:16 - 0:18

EVAN LIGHT: OK, so it's 2:01.
0:18 - 0:20

I guess we better get started, cause I was
told that,
0:20 - 0:22

all right I have to hit this little button,
0:22 - 0:25

that once I run out of time this little doo-hicky
here
0:25 - 0:26

is gonna make lots of noise
0:26 - 0:27

and then they're gonna bring out the,
0:27 - 0:29

the gong, and it won't be pretty.
0:29 - 0:34

So yeah, that's me. And we're, so right. I'm
0:34 - 0:36

mixed up. I'm Xavier Shay. I'm here to talk
0:36 - 0:38

about Ruby Profiling. If you were looking
for this
0:38 - 0:40

Evan Light guy, he's in that other room.
0:40 - 0:42

Oh, wait, no that's not right. Yeah, OK. We're
0:42 - 0:44

- I'm Evan Light and we're talking about recommendation
0:44 - 0:46

engines with Ruby and Redis, and why are there
0:46 - 0:50

more people in here than I expected? OK.
0:50 - 0:53

So, very briefly about me - I created and
0:53 - 0:55

run this event out in northern Virginia called
Ruby
0:55 - 0:59

DCamp. It's a three-day nerd commune in the
woods
0:59 - 1:02

for Ruby programmers. If you haven't heard
about it,
1:02 - 1:04

there are a bunch of handy, a bunch people,
1:04 - 1:08

participants here who have been before. So,
but, in
1:08 - 1:12

a nutshell, you come out in the woods, you
1:12 - 1:13

hack on Ruby code, you hang out with awesome
1:13 - 1:15

programmers, you are not allowed to leave
until the
1:15 - 1:17

very end.
1:17 - 1:20

And the attendees decide on basically everything
and they
1:20 - 1:21

have to do all the chores. And that sounds
1:21 - 1:22

like a lot of work, but it's really an
1:22 - 1:24

awful lot of fun. Oh, and free. But you
1:24 - 1:26

have to get, you have to get a code
1:26 - 1:27

in order to be able to attend.
1:27 - 1:30

Also, I work for this little company called
rackspace.
1:30 - 1:32

Can you guys raise your hands if you've heard
1:32 - 1:34

of us before? Oh, that's pretty good. How
many
1:34 - 1:38

of you guys use us? Or, well, I guess
1:38 - 1:43

I'll say currently use us. Hmm. That's not
too
1:43 - 1:45

many. We need to work on that some.
1:45 - 1:47

So I'm a, a what they call developer advocate
1:47 - 1:50

for rackspace. That is, that I'm here for
you
1:50 - 1:53

guys. Truly. And that's why I took the job.
1:53 - 1:54

I wanted the job where I could do more
1:54 - 1:57

for the Ruby community, and they basically
said, great,
1:57 - 1:59

that's the kind of person we want.
1:59 - 2:01

So if there's anything I can do to make
2:01 - 2:03

your lives, those few of you here, we need
2:03 - 2:06

more, who use rackspace, make your lives better
with
2:06 - 2:09

rackspace, great. And for those of you who
don't,
2:09 - 2:11

if there's anything you can think of that
would
2:11 - 2:13

make you want to - yeah, we'd like to
2:13 - 2:14

hear that, too.
2:14 - 2:17

Let's see. So moving right along. In a nutshell,
2:17 - 2:20

here's what we're gonna talk about. This is
a
2:20 - 2:25

case study of sorts for which I, a client
2:25 - 2:26

whose problem I solved with a recommendation
engine, we'll
2:26 - 2:29

talk about that. So we'll talk about the context,
2:29 - 2:31

the solution that I used - I need to
2:31 - 2:35

not look at my phone. Some Redis-related tangents,
because
2:35 - 2:37

this is a really all about Redis and Ruby,
2:37 - 2:40

and some painful lessons I learned along the
way.
2:40 - 2:43

So the context. The client of mine, who shall
2:43 - 2:45

remain nameless, just so that way I can be
2:45 - 2:49

a little freer with discussion. They had a
soccer,
2:49 - 2:51

or have, I should say, a soccer social network.
2:51 - 2:53

So imagine Facebook but for soccer.
2:53 - 2:56

Like, Facebook for blah blah blah, that's
pretty common
2:56 - 2:59

in California, right. But in their case, what
made
2:59 - 3:01

them really interesting is that they have
a live
3:01 - 3:04

feed of soccer data coming in all the time.
3:04 - 3:06

So, as games are being played, every time
there's
3:06 - 3:08

a red card or a yellow card, or someone
3:08 - 3:10

scores a goal or there's a penalty, they get
3:10 - 3:13

a notification about it.
3:13 - 3:14

And what they wanted to be able to do
3:14 - 3:16

is they wanted their users to be able to
3:16 - 3:21

see popular events, popular posts on their
site, so
3:21 - 3:25

the, the soccer event feed, as it's coming
in,
3:25 - 3:28

would be automatically spewed out into the
website as
3:28 - 3:31

a series of posts. And they would be contextualized,
3:31 - 3:33

that is, that they would have tags instead,
we'll
3:33 - 3:35

see more on that later.
3:35 - 3:37

So they wanted the, the users to be able
3:37 - 3:41

to see popular posts and relevant posts. And
in
3:41 - 3:44

near real-time, in that in near real-time
part means
3:44 - 3:46

that there's a little bit more exciting.
3:46 - 3:50

So recommendation engines - I'm sure that
most of
3:50 - 3:51

you are at least familiar with the idea, because
3:51 - 3:54

you use this thing called Google, probably.
Maybe you've
3:54 - 3:59

heard of it. So recommendation engines are
an approximation,
3:59 - 4:02

and they are based on, obviously, large sets
of
4:02 - 4:07

data, ideally. And in this case, we want two
4:07 - 4:09

different kinds of recommendations.
4:09 - 4:13

Again, we want what's popular - that's pretty
straightforward.
4:13 - 4:17

But what's relevant - and that's very subjective.
4:17 - 4:19

So they're based in statistics. But this is
me
4:19 - 4:23

an statistics. And, and, and, and this to
me
4:23 - 4:25

is actually what makes this talk interesting,
because I
4:25 - 4:31

built a recommendation engine being that dog.
4:31 - 4:34

So statistics - so recommendation engines
are canonically based
4:34 - 4:38

in the statistical methods and, yeah. Statistical
methods and
4:38 - 4:40

I, we don't get along so great. So this
4:40 - 4:41

is basically about how you do it with brute
4:41 - 4:44

force and still get away with it.
4:44 - 4:46

So other than being ignorant to statistical
methods, quite
4:46 - 4:48

frankly, I couldn't get the client to pay
me
4:48 - 4:50

for a day or two of research. I asked
4:50 - 4:52

them - I said, wouldn't you like to do
4:52 - 4:55

the, the right thing rather than just, just,
than
4:55 - 4:59

something probably ugly that'll work? And
they said, no
4:59 - 5:02

basically we trust you, so just go build it.
5:02 - 5:04

But I'm telling you, it'd be better if I
5:04 - 5:05

did a little research in advance.
5:05 - 5:07

No, no - just go build it.
5:07 - 5:08

OK.
5:08 - 5:11

Cause I like being paid.
5:11 - 5:13

So, why Ruby?
5:13 - 5:17

Well, kind of the same thing there. Their
developers
5:17 - 5:20

knew Ruby. They knew JavaScript. I said maybe
we
5:20 - 5:23

should use something faster, you know, Java
- which,
5:23 - 5:25

I feel is really funny to say, having been
5:25 - 5:26

a programmer for awhile. If you said Java
was
5:26 - 5:29

fast twenty years ago, you'd, I would, or
if
5:29 - 5:30

I'd said it, I'd be laughed out of the
5:30 - 5:31

room.
5:31 - 5:33

Nowadays, you have Java - fast. Go - fast.
5:33 - 5:36

C - fast. Even, JVM languages. I said Clojure
5:36 - 5:40

because I like Clojure. But, nope. They wanted
Ruby.
5:40 - 5:44

So, OK, Ruby it is. No statistical methods,
really,
5:44 - 5:47

fine. I'll figure something out.
5:47 - 5:49

So let's talk about the system a little bit.
5:49 - 5:52

Like, every social network, it has the typical
nouns
5:52 - 5:56

of users, posts, comments. You're used to
this. But
5:56 - 5:57

then we have a few new ones. We have
5:57 - 6:01

teams, players. I forget, I think they, they
had
6:01 - 6:03

a match as a noun, but really a match
6:03 - 6:06

to me was just two teams playing. An event
6:06 - 6:08

with two teams on it.
6:08 - 6:11

And then we had a series of verbs. So
6:11 - 6:13

submitting a post - I'm sure you're familiar
with
6:13 - 6:16

that. Except that I alluded to this a little
6:16 - 6:20

bit earlier - posts have tags. They're taggable
polymorphically.
6:20 - 6:23

So you could put any old thing on them,
6:23 - 6:25

but usually you would see teams and players,
and
6:25 - 6:27

that's really all they wanted out of the recommendation
6:27 - 6:27

engine.
6:27 - 6:30

It's important to mention. More on that later.
6:30 - 6:32

And it's not that import - it's really not
6:32 - 6:34

that important, it's just a fun point later.
6:34 - 6:36

So you can comment on a post - big
6:36 - 6:38

surprise. Again, social network, you probably
didn't expect to
6:38 - 6:41

see that. But you can tag posts, you can
6:41 - 6:44

tag, sorry, comments, with users - kind of
like
6:44 - 6:47

you could in Facebook. It was a little bit
6:47 - 6:50

more of a nuissance because they didn't have
a,
6:50 - 6:53

a tagging mechanism per se for users, like
Facebook
6:53 - 6:55

does. I just had to write something to scan
6:55 - 6:58

the text. Not entirely relevant to the rest
of
6:58 - 7:00

the discussion, so. We'll just keep on going.
7:00 - 7:03

Other verbs that kind of mattered a bit -
7:03 - 7:06

favoriting teams or players. This isn't something
that, that
7:06 - 7:08

Facebook had. More like a FourSquare thing,
when you
7:08 - 7:10

say, I love this. I love this team. I
7:10 - 7:14

love this player. And then liking posts. Pretty
typical
7:14 - 7:15

stuff.
7:15 - 7:19

So given a, a model that looks a- something
7:19 - 7:22

a little like this, leaving out comments and
likes
7:22 - 7:24

and favorites for now. Let's say you have
a
7:24 - 7:28

user, in this case he's user 2, he has,
7:28 - 7:30

he posted three posts. The first two posts
are
7:30 - 7:32

really the important ones, and the first two
posts
7:32 - 7:34

talk about tags one, two, and three.
7:34 - 7:37

So say we're given this. And maybe we have
7:37 - 7:41

something like this, but we don't initially,
where we
7:42 - 7:43

can say this other - we have this other
7:43 - 7:47

guy and he's interested in these tags. So
those
7:47 - 7:49

might be teams or players.
7:49 - 7:53

And they have these scalar values associated,
this user
7:53 - 7:55

has these scalar values associated with each
of these,
7:55 - 7:58

say, teams or players. So given those things,
what
7:58 - 8:00

we want, ultimately, is that.
8:00 - 8:02

We want to be able to say, this user
8:02 - 8:05

is going to be interested in these posts and
8:05 - 8:08

not that post. So post three and - going
8:08 - 8:13

back two slides - had tag four. And post,
8:13 - 8:15

and user 1 only cared about tags one, two,
8:15 - 8:17

and three. Not tag four.
8:17 - 8:19

So he's only interes- he should only have
a
8:19 - 8:21

score for post one, post two, and he shouldn't
8:21 - 8:23

have anything for post three because he just
doesn't
8:23 - 8:27

care. So we want something like that.
8:27 - 8:30

So this part here is where the, the interesting-ness
8:30 - 8:32

came in. When the client approached me, they
said,
8:32 - 8:35

well we have this idea for how a recommendation
8:35 - 8:37

engine would work. We'll just, we'll just
have a
8:37 - 8:40

weight associated with each one of these events
as
8:40 - 8:41

they occur.
8:41 - 8:44

Well, that's all well and good, but going
from
8:44 - 8:47

the first diagram with the post and the tags
8:47 - 8:49

to, oh, I have this in a single step
8:49 - 8:51

doesn't really make any sense.
8:51 - 8:53

I needed some kind of lens in order to,
8:53 - 8:56

to figure out what the user, what content
the
8:56 - 9:00

user would actually care about. So I needed
intermediate
9:00 - 9:01

value - I needed some intermediate values
to get
9:02 - 9:05

a sense of what does the user care about?
9:05 - 9:10

So moving on. We start with ActiveRecord.
Every good
9:10 - 9:13

application does - not really. But it was
a
9:13 - 9:16

Rails app, so yeah, we had ActiveRecord. But
really
9:16 - 9:21

we're talking about ActiveRecord::Observers.
So that's to say that
9:21 - 9:26

we would capture, or would capture the lifecycle
events
9:26 - 9:30

of the nouns I described earlier. And, well,
we
9:30 - 9:32

would have some data that we would feed into
9:32 - 9:35

something and we'll get there in just a minute.
9:35 - 9:37

So to reiterate, we cared about two different
kinds
9:37 - 9:40

of posts. Really, they're, well, posts are
posts. But
9:40 - 9:43

we care about quantifying them in two different
buckets.
9:43 - 9:46

Popular, which is a global thing, and relevant,
which
9:46 - 9:48

is subjective to the individual user.
9:48 - 9:51

So popularity is pretty straightforward. It,
it could be
9:51 - 9:54

made a little more complex, in this. Popularity,
if
9:54 - 9:59

I recall, is based on comments and likes.
And
9:59 - 10:03

I forget which was worth more. Because we,
we
10:03 - 10:04

would - and that's kind of irrelevent. The
point
10:04 - 10:06

is, they would have different weightings.
10:06 - 10:08

So for a trending this standpoint, a comment
might
10:08 - 10:10

be worth than a like or a like might
10:10 - 10:12

be worth more than a comment. One thing that
10:12 - 10:13

we had talked about doing that if we hadn't
10:13 - 10:15

done would have made life a lot more interesting,
10:15 - 10:17

is to have a notion of taste makers. And
10:17 - 10:20

that is, people who are super jazzed about
a
10:20 - 10:24

topic having their, their likes and their
comments being
10:24 - 10:29

more valuable in terms of popularity than
other people's.
10:29 - 10:32

If you instantly start thinking about gaming
the sytem
10:32 - 10:33

when I say something like that, then you're
basically
10:33 - 10:35

reading my mind. Because I kept going back
to
10:35 - 10:37

the client over and over again about that,
and
10:37 - 10:40

their response was, oh to have such problems.
And,
10:40 - 10:41

well I had to agree with them.
10:41 - 10:43

If someone games your system, well then you're
doing
10:43 - 10:45

pretty well for someone to care enough to
do
10:45 - 10:46

it.
10:46 - 10:48

Relevence is really where it gets a little
more
10:48 - 10:51

interesting, or a lot more interesting. So
we have
10:51 - 10:53

these verbs, or I guess these statements,
like if
10:53 - 10:55

you go and favorite DC United, or you submit
10:55 - 10:58

a post tag with DC United - let's say
10:58 - 11:00

you like DC United, or you comment on a
11:00 - 11:03

post that is tagged with DC United, or the
11:03 - 11:06

really confusing one, you're mentioned in
a comment on
11:06 - 11:08

a post tagged DC United.
11:08 - 11:09

If your head hurts on that one, I understand.
11:09 - 11:12

It took me awhile to wrap my brain around
11:12 - 11:12

it too.
11:12 - 11:13

So obviously, if you hadn't figured out, there
was
11:13 - 11:16

a time in my life when I liked DC
11:16 - 11:20

United. But I'm not really much into sports
anymore.
11:20 - 11:25

But moving right along. So, relevence is,
in this
11:25 - 11:27

system, is defined by an algorithm kind of
like
11:27 - 11:28

this.
11:28 - 11:32

So given an arbitrary event defined by an
AR
11:32 - 11:36

observer, or essentially serialized by an
AR obvserver, for
11:36 - 11:40

each tag on that event, for each user interested
11:40 - 11:44

in that tag, go score the user's interest
in
11:44 - 11:47

that tag, or go rescore assuming that there's
an
11:47 - 11:49

interest already.
11:49 - 11:50

So if you hadn't figured out already, that's
a
11:50 - 11:52

Big O events squared algorithm, if you're
in computer
11:52 - 11:55

science. And that's a bad, bad, bad thing.
Damn
11:55 - 11:57

- I was hoping they might have been more
11:57 - 12:02

Pacific Rim fans in the audience, but. Oh
well.
12:02 - 12:06

So yeah, Big O N squared algorithm. I'm up
12:06 - 12:09

against it, so I'm thinking there's - this,
this,
12:09 - 12:11

this is bad. What am I gonna do with
12:11 - 12:15

this situation? Well, how could we cheat?
12:15 - 12:18

So, it occurred to me, we're talking about
soccer
12:18 - 12:22

matches, about sports games. We're talking
about wanting timely
12:22 - 12:25

recommendations. Why do we care about stuff
that's in
12:25 - 12:27

the past? WE shouldn't. So I went to the
12:27 - 12:29

client and said, what if we just say, have
12:29 - 12:31

a window of three days and then after that
12:31 - 12:32

we just don't care anymore?
12:32 - 12:34

And they said thumbs up, and I thought, oh
12:34 - 12:36

great, now there's a whole lot of data I
12:36 - 12:38

don't have to worry about. So, Big O N
12:38 - 12:41

squared is bad, but N just got a whole
12:41 - 12:43

lot smaller.
12:43 - 12:45

By the way, sorry, computer science parlence.
Big O
12:45 - 12:47

of N squared is to say it's a nested
12:47 - 12:51

loop, and N is some arbitrarily large constant.
It's
12:51 - 12:54

the largest, if, I think, we're being concrete
about
12:54 - 12:56

it, the size of the largest data structure
would
12:56 - 12:59

be iterating over. So the big O is worst
12:59 - 13:03

case run time of this algorithm would be looping
13:03 - 13:07

over the longest structure in a nested fashion.
And
13:07 - 13:09

that's generally very slow - you don't want
to
13:09 - 13:11

do that.
13:11 - 13:14

So we only care about recent posts, as I
13:14 - 13:17

said a moment ago. But now we, we've narrowed
13:17 - 13:20

down what events we care about. We need some
13:20 - 13:22

kind of event consumer. So how many of you
13:22 - 13:24

are familiar with resque?
13:24 - 13:28

Hmm, OK. About half. I wasn't sure what to
13:28 - 13:29

expect. Interesting.
13:29 - 13:32

So Resque is a, a queuing system for processing
13:32 - 13:36

background tasks, and it's written using this
thing called
13:36 - 13:38

Redis, which was in the talks. I assume you
13:38 - 13:40

might be vaguely interested in this thing
called Redis.
13:40 - 13:43

How many people know about Redis, are kind
of
13:43 - 13:44

comfortable with it?
13:44 - 13:46

OK, that's a little more than half. Pretty
good.
13:46 - 13:47

So I'll keep this short.
13:47 - 13:51

Again, don't time me. So Redis is a key
13:51 - 13:53

value store, which is to say it's a little
13:53 - 13:54

bit like a mem hash, or if you just
13:54 - 13:57

want to speak more Ruby parlence, it's basically
like
13:57 - 14:00

a glorified hash, except it runs as a server,
14:00 - 14:03

as a daemon process basically.
14:03 - 14:06

It lives, or, it's storage is in-memory, but
it
14:06 - 14:08

can persist to disk, and there are a couple
14:08 - 14:10

of different persistence options that give
you a little
14:10 - 14:15

bit of flexibility about how often, how reliable
it
14:15 - 14:17

persists.
14:17 - 14:19

And the interesting thing about Redis is it's
not
14:19 - 14:21

just a straight key-value storage, it's not
just a
14:21 - 14:23

hash, or I guess you could say it is
14:23 - 14:24

a lot like a Ruby hash in some ways,
14:24 - 14:27

because the value doesn't have to just be
a
14:27 - 14:28

string. The value could be some kind of data
14:28 - 14:32

structure. And Redis supports, well, the ones
listed here.
14:32 - 14:37

So lists, so lists allow repitition. And they're
sorted.
14:37 - 14:39

Well, they're sorted based on insertion order,
I should
14:39 - 14:40

say.
14:40 - 14:42

A hash, as you might expect, so actually,
so
14:42 - 14:45

key-value is by virt- by nature a lot like
14:45 - 14:47

a hash, so basically you can have hashes in
14:47 - 14:50

your hashes. You don't necessarily want to
use those,
14:50 - 14:52

and we'll talk about that soon.
14:52 - 14:56

Sets. So a list where the insertion order
doesn't
14:56 - 14:59

necessarily matter, but no repetition is allowed,
and sorted
14:59 - 15:01

sets, which are pretty darn interesting because
they don't
15:01 - 15:04

allow repetition and they maintain a sorting
order, and
15:04 - 15:08

you're, so you're inserting a value and some
sortable
15:08 - 15:10

value to go with it.
15:10 - 15:12

Well, and again, more on that later.
15:12 - 15:14

Maybe one of the most interesting parts to
me
15:14 - 15:17

about Redis is that it supports adding a time
15:17 - 15:21

to live, an arbitrary time to live, user-definable,
to
15:21 - 15:24

any given key that you put in Redis. Now
15:24 - 15:27

when I say key, I need to be very
15:27 - 15:29

specific. Key, at the macro level of key-value
for
15:29 - 15:32

Redis. So if you store a hash, a hash
15:32 - 15:35

has a single key that refers to the whole
15:35 - 15:36

hash.
15:36 - 15:38

If you're storing a list or a set or
15:38 - 15:40

a sorted set, there is one key that points
15:40 - 15:42

to the whole thing. So you put a TTL
15:42 - 15:44

on that, and what that says is, I want
15:44 - 15:46

this value to just go away after this amount
15:46 - 15:49

of time. That can be pretty handy. So when
15:49 - 15:52

I mentioned that three day window earlier,
the TTL
15:52 - 15:55

is very handy there.
15:55 - 15:58

That's just a little too big font, font-wise.
So
15:58 - 16:02

the AR::Observers were pushing events out
to Resque. And
16:02 - 16:04

the event would look something like this.
It'd be
16:04 - 16:07

pushing JSON up. So the event would have the
16:07 - 16:12

type, the noun, essentially, the action - I
think
16:12 - 16:16

we were only concerned with creates, and occassionally
deletes.
16:16 - 16:19

But we didn't really care about updates.
16:19 - 16:21

I offered to add that. It wasn't, this was
16:21 - 16:23

a one-over lease, it just wasn't something
that mattered
16:23 - 16:24

that much at the time.
16:24 - 16:26

Then we would have the ID of whatever the
16:26 - 16:28

thing was, the user ID, because that very
much
16:28 - 16:31

matters here since we're talking about the
user's interest
16:31 - 16:34

in things. And then the names of the tags
16:34 - 16:35

associated.
16:35 - 16:38

But, we have all this stuff queued up, but
16:38 - 16:40

one does not simply share the load. We have
16:40 - 16:44

to define our workers. So the worker that
I,
16:44 - 16:47

I created, I called a calculator because I
figured
16:47 - 16:52

we're calculating a score, and the calculator
originally was
16:52 - 16:55

just one giant class. And it was aweful.
16:55 - 16:58

So a TDD very quickly showed me how bad
16:58 - 17:00

of an idea this was, as my tests grew
17:00 - 17:02

to be more and more hard. SO then I
17:02 - 17:05

started to break it out into three different
kinds
17:05 - 17:10

of calculators that formed a sort of workflow.
And
17:10 - 17:13

also I, I learned through more TDD suffering
that
17:13 - 17:16

I shouldn't even have my calculate, individual
calculators think
17:16 - 17:20

about persistence, because then that made
their already busy
17:20 - 17:22

life of trying to compute things even busier
by
17:22 - 17:24

trying to worry about, well, where do I put
17:24 - 17:25

this stuff when I'm done.
17:25 - 17:29

So instead I just had the outer level calculator
17:29 - 17:32

act as a sort of strategy, I guess, in
17:32 - 17:35

the object-oriented sense. And, so he was
the Resque
17:35 - 17:37

worker, and he handled all the persistence,
and he
17:37 - 17:39

just directed the other guys to do work. He
17:39 - 17:41

would call one guy, get his output, pass it
17:41 - 17:43

on to the other and so forth.
17:43 - 17:45

So persistence was handled by Redis, but I
created
17:45 - 17:50

a very simple abstraction around it, just
a class,
17:50 - 17:54

so that way the customer could decide later,
oh
17:54 - 17:56

well, storing everything in memory is kind
of sucking,
17:56 - 17:58

so Redis is costing us hundreds of dollars
now.
17:58 - 18:02

A month, or more. Because Redis again is all
18:02 - 18:03

memory, and memory gets a lot more expensive
when
18:03 - 18:05

you start getting bigger and bigger and bigger
chunks
18:05 - 18:08

of RAM. So I thought at some point they
18:08 - 18:12

might want something like, dare I say, MongoDB
-
18:12 - 18:16

not a big fan, but. Something like that, maybe.
18:16 - 18:18

So I put that there. It wasn't something I
18:18 - 18:20

really had to worry about too much while I
18:20 - 18:22

was working with them again for 1.0 version,
but
18:22 - 18:24

it seemed like an easy win.
18:24 - 18:29

So getting into the individual calculators.
The trengingness calculator,
18:29 - 18:32

just like the, the, my discussion about popularity
earlier,
18:32 - 18:34

this guy was really straightforward. You like
something. That
18:34 - 18:37

bumps up the score on a post. You comment
18:37 - 18:38

on something, that bumps up the score on a
18:38 - 18:40

post. Really dumb.
18:40 - 18:42

And then it outputs, so it would get the
18:42 - 18:44

event, it would output a new score for that
18:44 - 18:46

individual post.
18:46 - 18:48

The way that data was stored was just as
18:48 - 18:51

a simple key-value pair in Redis. So you would
18:51 - 18:55

have, and this was actually a, I guess as
18:55 - 18:57

a brief aside, this was a little uncommon
for
18:57 - 18:58

me. I was trying to find lots of ways
18:58 - 19:00

to use Redis data structures, and for whatever
reason
19:00 - 19:02

this made more sense to me as a key-value
19:02 - 19:03

pair.
19:03 - 19:05

As it turned out, it probably would have been
19:05 - 19:08

better as something else. But that's in the
lessons
19:08 - 19:09

learned section.
19:09 - 19:12

So I would munge the keys so I could,
19:12 - 19:14

you know, name space the values I was storing.
19:14 - 19:16

Because if I just had the post ID in
19:16 - 19:17

Redis, then I would have a key of forty-two
19:17 - 19:21

and, well, if that's the, if that's the post
19:21 - 19:22

ID, if I wanted to store anything else for
19:22 - 19:24

that key, well, I would overwrite whatever
was there
19:24 - 19:26

and that would suck.
19:26 - 19:28

So I would put something up front, like say,
19:28 - 19:31

trend for trendingness. That's pretty common
in key-value stores
19:31 - 19:35

to have long munged names sometimes just for
namespacing
19:35 - 19:38

purposes.
19:38 - 19:41

So let's see, right. In the key-value, the
trendingness
19:41 - 19:43

scores had the three day TTL that I talked
19:43 - 19:46

about earlier. The one part that I regretted
here
19:46 - 19:49

was that these values were sorted in Ruby
at
19:49 - 19:52

run-time, when trendingness was requested.
Now remember we're only
19:52 - 19:54

talking about three days worth of posts.
19:54 - 19:56

And this was for a fairly new social network.
19:56 - 19:58

So, again, going back to the remark I made
19:58 - 20:00

about gaming, you know, oh to have such problems,
20:00 - 20:04

where sorting in Ruby would be that painful.
But
20:04 - 20:07

I would far rather sort and say something
like,
20:07 - 20:09

C or Java, which is like 100 times faster,
20:09 - 20:12

so that sorting wouldn't be as painful as
soon,
20:12 - 20:14

but alas.
20:14 - 20:16

So the user interest calculator. This is where
we
20:16 - 20:18

start getting into that, that relevance business.
Deciding which
20:18 - 20:22

users care about what. So it would get the
20:22 - 20:26

event, and but it's important to mention that
on
20:26 - 20:28

a, for a given even there might be multiple
20:28 - 20:31

users that might care about the event. And
the
20:31 - 20:33

reason for that is because, you have the person
20:33 - 20:35

who posted the original post, but then you
have
20:35 - 20:37

all the commenters, you have all the likers.
20:37 - 20:39

So you have to aggregate all of those people
20:39 - 20:43

together because if anything else happens
in this event,
20:43 - 20:48

these people have expressed some degree of
interest in
20:48 - 20:50

the tags that are involved. I don't think
I
20:50 - 20:52

have a slide for this - so I wish
20:52 - 20:54

I had, I'll take a brief aside to mention
20:54 - 20:57

that every single one of those verbs had a
20:57 - 20:59

waiting factor associated with it.
20:59 - 21:01

So I'm just computer scalars here. I did have
21:01 - 21:05

an AI class twenty years ago, back in college.
21:05 - 21:08

So I learned a little, little bit.
21:08 - 21:11

So each one of those events would have some
21:11 - 21:13

kind of waiting associated with it, so when
we
21:13 - 21:14

had a scalar value we would know it was
21:14 - 21:17

based primarily on this, and a little bit
of
21:17 - 21:18

that, and a little bit less of this and
21:18 - 21:20

a little bit less of that, like, when you
21:20 - 21:22

favorite something, that's a big, large declaration
to say,
21:22 - 21:26

I love this! When you comment on something
-
21:26 - 21:29

well, maybe I kind of like it, and if
21:29 - 21:32

you are tagged on a comment belonging to a
21:32 - 21:35

post - eh, OK, that's a pretty weak attachment
21:35 - 21:37

but that connotes some degree of interest,
cause you're
21:37 - 21:40

associated with someone who cares about something
else.
21:40 - 21:42

So that's a very weak association, but it
is
21:42 - 21:43

some form of association.
21:43 - 21:46

So all of those users needed to have their
21:46 - 21:50

interests rescored. Right, and I just mentioned
arbitrarily assign
21:50 - 21:51

the weights for event times, so that was all
21:51 - 21:53

I had, that one bullet. So I think the
21:53 - 21:56

aside was worth it.
21:56 - 21:59

So this is how we get this structure, that
21:59 - 22:01

we have a user and we have a score
22:01 - 22:05

for each tag based on that big nasty big
22:05 - 22:08

O n squared algorithm that we defined earlier.
22:08 - 22:10

So internally in Redis this is how I would
22:10 - 22:14

store it. I would have one hash per user,
22:14 - 22:19

and the field, so that the key would be,
22:19 - 22:22

something like UI - the User ID, the user
22:22 - 22:23

interest. It's something that we would look
up an
22:23 - 22:26

awful lot, so having a nice short key seemed
22:26 - 22:29

important. Having it munged kind of essential
because, again,
22:29 - 22:32

we don't want to step on User ID values
22:32 - 22:35

with something else later.
22:35 - 22:39

I think Redis calls the individual keys in
a
22:39 - 22:41

hash fields - I don't remember if I have
22:41 - 22:44

my Redis nomenclature right. But the field
names were
22:44 - 22:47

just the tag names, and then the values were
22:47 - 22:49

the scalar interests.
22:49 - 22:50

And intentionally I did not want to put any
22:50 - 22:53

kind of time to live on that hash, because
22:53 - 22:55

users interests are one thing we know are
gonna
22:55 - 22:57

live on and on and on and on.
22:57 - 22:59

Downside is the user interests, the users'
interests are
22:59 - 23:01

something that will live on and on and on
23:01 - 23:02

and on. So you know that they're just gonna
23:02 - 23:05

take up more and more space. Tags are not
23:05 - 23:08

something that, that leave the system very
often, because
23:08 - 23:11

players tend to play for awhile, and even
if
23:11 - 23:13

they retire they might get mentioned again
in the
23:13 - 23:17

social network, so I don't know that the players
23:17 - 23:19

are gonna leave the system often. Teams likewise.
23:19 - 23:21

So it just made sense to just basically leave
23:21 - 23:23

these datastructures alone and let them grow.
That said,
23:23 - 23:26

having those in Redis, neh, it bugged me a
23:26 - 23:30

little. But, again, one O system, it wasn't
a
23:30 - 23:31

big concern.
23:31 - 23:36

So post score calculator is, I think I got
23:36 - 23:37

mixed up there. Post score calculator is where
the
23:37 - 23:40

big big o n squared nastiness came in. So
23:40 - 23:44

now we've rescored all these users' interests.
WE need
23:44 - 23:48

to go propogate this throughout the system.
23:48 - 23:49

And so again we have, back to the, excuse
23:49 - 23:54

me I'm so sorry - but I discovered after
23:54 - 23:57

the fact a name for this pattern that I
23:57 - 24:01

came upon. It's called inverted indices. And
inverted index,
24:01 - 24:03

that's a link by the way - these will
24:03 - 24:04

go up on GitHub at some point. This is
24:04 - 24:06

all HTML. You'll be able to click through.
You
24:06 - 24:08

don't have to take any notes, if by chance
24:08 - 24:09

you are.
24:09 - 24:13

An inverted index is basically just an index
of
24:13 - 24:16

the content to where the content is stored.
So
24:16 - 24:20

I had a few difference sets I would, for,
24:20 - 24:21

let's see, a post, I would have the set
24:21 - 24:24

of all the tags, and let's see, and I
24:24 - 24:26

actually have to read this cause I don't remember
24:26 - 24:27

off the top of my head.
24:27 - 24:29

And then I had a set of right, the
24:29 - 24:32

interested user IDs by tag. And that would
save
24:32 - 24:33

me from having to go out to the database
24:33 - 24:34

all the time to perform a whole bunch of
24:34 - 24:36

expensive queries. I could just go to Redis
and
24:36 - 24:40

say hey just give me and boom, there I
24:40 - 24:41

go.
24:41 - 24:43

And the user post scores were also stored
in
24:43 - 24:46

Redis as a hash, very much like the user
24:46 - 24:48

interest scores. It's just instead of having
the tag
24:48 - 24:51

you had a post.
24:51 - 24:54

So this structure, I, I showed you earlier,
it's
24:54 - 24:56

a workflow, but really what it also could
be
24:56 - 24:58

is a series of queues. It's not a big
24:58 - 25:01

truck. You don't just dump stuff on it. Thank
25:01 - 25:05

you - I'm so glad somebody appreciated that
one.
25:05 - 25:11

So some other design considerations that came
up as
25:11 - 25:15

we went along. So I'm, I've alluded a few
25:15 - 25:18

times. I was trying to aggressively optimize
the RBDMS
25:18 - 25:21

out of the equation. The client very much
did
25:21 - 25:23

not want the recommendation engine to make
the Rails
25:23 - 25:25

app run slower, because, well, they're on,
they, let's
25:25 - 25:28

see, they're on Heroku and, you know, ?? [00:25:27]
25:28 - 25:29

cost money.
25:29 - 25:34

And we already talked about using inverted
indices to
25:34 - 25:37

some effect, again reducing - further reducing
the need
25:37 - 25:41

for database queries. And I already talked
about those
25:41 - 25:42

examples.
25:42 - 25:46

Now the other thing I, that I, I've mentioned,
25:46 - 25:48

that I broke the calculator down into a trendingness
25:48 - 25:51

calculator and a user interest score calculator
and post
25:51 - 25:55

score calculator, and no one's made any kind
of
25:55 - 25:58

rude gestures, but those names really suck.
I'm sorry.
25:58 - 26:00

Post score calculator - just, it's German.
Take a
26:00 - 26:04

whole bunch of words together and mush them
together.
26:04 - 26:06

Is it good enough? Well, it ran in production,
26:06 - 26:09

and the customer was happy. So yay.
26:09 - 26:14

Was I ashamed of a lot of the code
26:14 - 26:18

I wrote? Oh god yes. Would it scale? Well,
26:18 - 26:22

I was limited by Redis, so memory, RAM. And
26:22 - 26:25

I knew I'd be limited by CPU. But what
26:25 - 26:27

they were conc- what they were concerned with
was
26:27 - 26:29

getting a one-over lease out the door, something
that
26:29 - 26:31

people could use right away, and if they'd
be
26:31 - 26:33

successful, well, worst case they would rewrite
it. But
26:33 - 26:37

there was potential to refactor and scale
it further.
26:37 - 26:40

One of the little interesting things that
happened along
26:40 - 26:43

the way is, because a post, the post is
26:43 - 26:48

polymorphically taggable, you could just throw
anything on it.
26:48 - 26:52

So the engine originally didn't just care
about teams
26:52 - 26:53

and players - it just took any old tag
26:53 - 26:56

you gave it. And the, the client later said
26:56 - 26:57

yeah, I really only want the other teams and
26:57 - 27:01

players, but the interesting side effect was,
well users
27:01 - 27:02

would get thrown in their as tags too.
27:02 - 27:04

So I thought, you know, maybe a side business
27:04 - 27:07

is a sports dating site or something, because
all
27:07 - 27:08

of the sudden it would say, hey, this is
27:08 - 27:10

how interested I am in this person versus
that
27:10 - 27:10

person.
27:10 - 27:14

But, no, I took that out and hard-coded it
27:14 - 27:16

to just teams and players.
27:16 - 27:20

So, so lessons learned along the way.
27:20 - 27:22

Statistical methods - obviously would have
been nice, because
27:22 - 27:24

any time I had to write a big O
27:24 - 27:25

N squared algorithm and I know Ns gonna keep
27:25 - 27:28

getting bigger, I get really anxious. I did
not
27:28 - 27:30

like writing this.
27:30 - 27:33

The lesson learned for me, if I were still
27:33 - 27:34

freelancing, but even though I'm not I work
at
27:34 - 27:37

rackspace, is when I know something is right,
I
27:37 - 27:38

need to argue for it, just a little bit
27:38 - 27:41

more.
27:41 - 27:45

Prefer straight key-value over hashes. I've
mentioned TTLs and
27:45 - 27:48

I think I mentioned TTLs and I mentioned TTLs.
27:48 - 27:51

You can't put a TTL on a field on
27:51 - 27:54

hash. So you can't say I want this for
27:54 - 27:57

this user, this post score to expire some
time
27:57 - 28:00

in the future. No, you're stuck with that
guy.
28:00 - 28:02

So there is a way around that. That's, I
28:02 - 28:04

think, the next slide. But that's more work.
28:04 - 28:06

What I could have done instead is just had
28:06 - 28:10

longer munge names. Said, like, user ID, user
ID
28:10 - 28:13

blah blah blah, post ID blah blah blah, and
28:13 - 28:15

then just had a value and put a TTL
28:15 - 28:17

on that and then it would just disappear in
28:17 - 28:19

three days and life would have been better.
28:19 - 28:23

Extracing small - OK, so one more slide. Extracting
28:23 - 28:25

smaller workers - when I said this is a
28:25 - 28:27

series of queues, what I was getting at was
28:27 - 28:31

I designed this system expecting that post
score calculator
28:31 - 28:34

would inevitable get more CPU than the other
guys.
28:34 - 28:38

So it, there're all written as Resque workers,
but
28:38 - 28:40

only the outside calculator is a Resque worker.
28:40 - 28:42

It would have been fairly trivial to extract
the
28:42 - 28:44

other three, give them all their each, give
them
28:44 - 28:46

each their own queue. The only thing I would
28:46 - 28:48

have had to have done is I would have
28:48 - 28:50

had to add persistence capability to them,
and that
28:50 - 28:53

could have been something dependency injectable,
for example.
28:53 - 28:56

And that would have been kind of simple.
28:56 - 28:57

The other thing that would have been nice
is
28:57 - 29:00

each one of those little guys basically runs
on
29:00 - 29:02

a case statement, and that's the big giant
oh,
29:02 - 29:05

oh, scream of please extract me, please extract
me
29:05 - 29:08

and, well. Oh well.
29:08 - 29:12

Less chattiness with Redis. So I just was
making
29:12 - 29:14

individual calls to Redis if I needed to get
29:14 - 29:16

a set, I would just make a single call.
29:16 - 29:18

If I needed to do, push a key value
29:18 - 29:20

pair, at least a single call. It was enough.
29:20 - 29:25

It worked well enough. Individual calls on
AWS from
29:25 - 29:26

Heroku to Redis - I did actually bench it.
29:26 - 29:28

It was something like 2 milliseconds.
29:28 - 29:30

They add up, but if you're, only if you're
29:30 - 29:32

making a ton of them. This was still way
29:32 - 29:35

faster than the Rails app, so it, it really
29:35 - 29:37

wasn't a big concern. But something to be
aware
29:37 - 29:37

of.
29:37 - 29:41

Redis supports two different features that,
only one at
29:41 - 29:44

the time when I wrote this recommendation
engine, that
29:44 - 29:46

would have helped here. Pipelining, which
allows you to
29:46 - 29:50

just batch up commands. You get futures back,
which
29:50 - 29:52

is basically saying here, this is where you
result
29:52 - 29:54

will go later. And then all of the results
29:54 - 29:57

come back, and you just access the futures
to
29:57 - 29:59

get the results.
29:59 - 30:01

So you send one big request with all of
30:01 - 30:02

your different commands, and then you get
a bunch
30:02 - 30:07

of little responses back when they're ready.
And that
30:07 - 30:09

will result in less network chattiness, which
means you're
30:09 - 30:11

app will run faster.
30:11 - 30:14

The second one, and this one is very dangerous
30:14 - 30:18

because Redis is an evented key-value store,
I didn't
30:18 - 30:21

mention. Which means there's one thread. You
can script
30:21 - 30:24

inside of Redis. If you script badly inside
of
30:24 - 30:28

Redis you might occupy that one thread for
awhile,
30:28 - 30:30

and when you send commands to Redis, it might
30:30 - 30:32

say, sorry I'm busy right now.
30:32 - 30:33

So that would be bad. You know, like crossing
30:33 - 30:36

the streams in ghost busters.
30:36 - 30:39

So pruning. When I mentioned not wanting to
use
30:39 - 30:43

hashes so much in Redis, if you are gonna
30:43 - 30:45

use something that's gonna grow and grow and
grow
30:45 - 30:47

and nothing ever expires and you want things
to
30:47 - 30:51

be removed eventually, one option is to put
a,
30:51 - 30:54

a timestamp in every value that you're gonna
put
30:54 - 30:55

in that hash.
30:55 - 30:57

So instead of just putting straight values
into a
30:57 - 31:00

hash, just integers for example, you just
put JSON
31:00 - 31:02

in, like we do with Resque, and that might
31:02 - 31:05

have a timestamp and then the value. And then
31:05 - 31:08

what that means is periodically, although
I've heard that
31:08 - 31:10

the, the best practice is maybe every time
you
31:10 - 31:13

do an insertion into a structure - also go
31:13 - 31:15

through and prune that structure, look for
things that
31:15 - 31:19

you can remove that have outlived their usefulness.
31:19 - 31:22

But I mentioned earlier, still, better to
prefer a
31:22 - 31:25

key-value where you can just set a TTL than
31:25 - 31:26

to go and have to deal with pruning. Pruning
31:26 - 31:28

is more work. I told the client that was
31:28 - 31:31

something was concerned about for later. Again,
for 1.0
31:31 - 31:33

they didn't care. They were like, I could
wash
31:33 - 31:34

my hands of it and walk away, but it,
31:34 - 31:37

still, I didn't like knowing memory would
just grow.
31:37 - 31:39

I used to code in C and Java and
31:39 - 31:43

you don't like leaking memory.
31:43 - 31:46

So one other thing I realized actually just
today
31:46 - 31:48

was the calculator, because it was stateless,
could have
31:48 - 31:53

benefited a lot from a functional programming
style using
31:53 - 31:56

what's called referential transparency. And
really that's just a
31:56 - 31:58

fancy way of saying the output from one function
31:58 - 32:00

is the input to the next function. And you
32:00 - 32:04

just accumulate state by taking that output
from, from
32:04 - 32:06

one function, passing that as your input to
the
32:06 - 32:08

next, along with whatever stuff you need to,
and
32:08 - 32:10

just keep accumulating and your final output's
what you
32:10 - 32:11

care about.
32:11 - 32:12

That might have been pretty nice to do. It
32:12 - 32:14

might have made the code a little bit readable
32:14 - 32:17

because the imperative style can be a bit
hard
32:17 - 32:20

to follow sometimes. I know I was, as I
32:20 - 32:22

said I wasn't thrilled with the end result
of
32:22 - 32:23

the code, and I tried really hard to make
32:23 - 32:25

it readable, but the imperative style didn't
look too
32:25 - 32:28

good in the calculator.
32:28 - 32:30

And the final lesson learned is you do something
32:30 - 32:33

faster in Ruby.
32:33 - 32:35

So, but that, that was really just a joke.
32:35 - 32:38

Because Ruby was actually adequate to the
task. It
32:38 - 32:39

wasn't a problem. So this is the part where
32:39 - 32:41

I get to say I've got seven minutes and
32:41 - 32:45

forty-one seconds. Are there any questions,
heckling, or other
32:45 - 32:48

statements, remarks, something?
32:49 - 32:52

No? OK. Cool, well. Three minutes left, so
thanks
32:52 - 32:53

very much.

Title:: Ruby Conf 2013 - Recommendation Engines with Redis and Ruby by Evan Light
Description:: more » « less
Duration:: 33:20

Amara Bot edited English subtitles for Ruby Conf 2013 - Recommendation Engines with Redis and Ruby by Evan Light

English subtitles

Revisions

Revision 1 Imported

Amara Bot

Ruby Conf 2013 - Recommendation Engines with Redis and Ruby by Evan Light

Revisions

Our website uses cookies

Operating cookies (Required)