< Return to Video

Ruby Conf 2013 - Fault Tolerant Data: Surviving the Zombie Apocalypse

  • 0:17 - 0:21
    CASEY ROSENTHAL: Hi. Hope everybody had a
    good lunch.
  • 0:21 - 0:28
    I'd like to start by reading the title of
    my presentation.
  • 0:29 - 0:33
    Fault Tolerant Data: Surviving the Zombie
    Apocalypse.
  • 0:33 - 0:38
    I'm Casey Rosenthal. That's my Twitter handle.
    If you're
  • 0:38 - 0:39
    interested in some of the stuff I talk about,
  • 0:39 - 0:41
    I figure the most efficient way for me to
  • 0:41 - 0:43
    get links to you is I'll, I'll Tweet the
  • 0:43 - 0:47
    links after the talk.
  • 0:47 - 0:49
    Very small bit of background on me - I
  • 0:49 - 0:52
    worked for a company called Basho. Our icons
    look
  • 0:52 - 0:56
    like this because it's named after a Japanese
    poet.
  • 0:56 - 0:58
    So that's me. I get to work with a
  • 0:58 - 1:02
    really good team of consultants. That's the
    consulting team
  • 1:02 - 1:05
    I've had the opportunity to work with.
  • 1:05 - 1:09
    And we set up distributed databases for critical
    data
  • 1:09 - 1:16
    for large enterprises. So recording things
    like internet traffic,
  • 1:16 - 1:19
    transactions and things that lives depend
    on, like medical
  • 1:19 - 1:24
    records and video game scores. Really important
    things like
  • 1:24 - 1:25
    that.
  • 1:25 - 1:28
    And to give you a sense of where, what
  • 1:28 - 1:34
    my motivation is - I love working with data.
  • 1:34 - 1:38
    And I'm gonna make the, you can call databases
  • 1:38 - 1:43
    sequel versus no sequel, new sequel, no dbs,
    there's
  • 1:43 - 1:45
    a bunch of different ways of chopping up the
  • 1:45 - 1:49
    database space right now. But from a software
    engineer's
  • 1:49 - 1:52
    perspective, I'm gonna draw a primary distinction
    between sequel
  • 1:52 - 1:55
    and other, OK.
  • 1:55 - 2:00
    So I work particularly with a distributed
    key-value database.
  • 2:00 - 2:02
    Most of the data modeling techniques I'm gonna
    talk
  • 2:02 - 2:09
    about are applicable to distributed key-value
    and other kinds
  • 2:09 - 2:12
    of distributed databases. Not just the one
    that I
  • 2:12 - 2:17
    happen to work with most, which is Reaq??
    [00:02:15].
  • 2:17 - 2:20
    So the reason that I love this category is
  • 2:20 - 2:24
    other is because I'm gonna generalize, I'm
    totally gonna
  • 2:24 - 2:27
    generalize about your experience. Most of
    you have worked
  • 2:27 - 2:30
    with relational databases. I don't really
    care if that's
  • 2:30 - 2:32
    true, I'm just gonna assume it.
  • 2:32 - 2:35
    So I also come from a background with sequel,
  • 2:35 - 2:41
    with relational databases, and over many,
    many years working
  • 2:41 - 2:44
    in other languages, but also Ruby, I got to
  • 2:44 - 2:47
    the point where, OK, I kind of understood
    what
  • 2:47 - 2:50
    an MBC framework looked like. I'm not gonna
    come
  • 2:50 - 2:53
    across any new surprises there.
  • 2:53 - 2:56
    With the decades of institutional knowledge
    we have about
  • 2:56 - 3:00
    sequel databases, nothing is really gonna
    surprise me about
  • 3:00 - 3:02
    how to write, how to data- how to mile
  • 3:02 - 3:05
    data on a relational database.
  • 3:05 - 3:08
    And if you know your design patterns, and
    you
  • 3:08 - 3:11
    probably have a couple that you use regularly,
    it's
  • 3:11 - 3:14
    unlikely that you're gonna come across a design
    pattern
  • 3:14 - 3:16
    that, like, completely blows your mind and
    changes how
  • 3:16 - 3:19
    you do everything.
  • 3:19 - 3:23
    In the category of other databases, that happens
    on
  • 3:23 - 3:27
    a regular basis. New databases come along
    that completely
  • 3:27 - 3:31
    change, not only how you understand modeling
    your data,
  • 3:31 - 3:34
    but the kinds of applications you're able
    to build,
  • 3:34 - 3:37
    because they have different properties, they
    can do new
  • 3:37 - 3:40
    things, they look at data in a different way
  • 3:40 - 3:43
    that you didn't have the capability of accessing
    before.
  • 3:43 - 3:45
    So it's really cutting edge and it changes
    a
  • 3:45 - 3:50
    lot and it's a good brain exercise.
  • 3:50 - 3:52
    So part of that is just the computology aspect.
  • 3:52 - 3:55
    I happen to like reading white papers, and
    there's
  • 3:55 - 3:58
    a lot of great academic work going on right
  • 3:58 - 4:01
    now in the other database and the distributed
    systems
  • 4:01 - 4:03
    sphere.
  • 4:03 - 4:05
    There are a lot of unsolved problems, still.
    There
  • 4:05 - 4:08
    are cases where we don't know the implications
    of
  • 4:08 - 4:12
    setting up certain kinds of distributed architectures.
    And I'll
  • 4:12 - 4:16
    touch on a couple of them.
  • 4:16 - 4:19
    The other reason this field is exciting to
    me
  • 4:19 - 4:25
    is because of this principle of high availability.
  • 4:25 - 4:29
    Yeah.
  • 4:29 - 4:33
    Zing.
  • 4:33 - 4:39
    So think for a moment about Google, right.
    When
  • 4:39 - 4:43
    you go to search something, you don't necessarily
    expect
  • 4:43 - 4:48
    that it has everything indexed up to the exact
  • 4:48 - 4:51
    second that you hit the search button, right.
    We
  • 4:51 - 4:54
    allow a little bit of fuziness in the results.
  • 4:54 - 4:55
    In fact, we don't even expect that if you
  • 4:55 - 4:59
    search here, you're gonna get the same exact
    results
  • 4:59 - 5:01
    at the same time as if somebody did the
  • 5:01 - 5:04
    same exact search in LA, right.
  • 5:04 - 5:06
    We allow for a little bit of oddness there.
  • 5:06 - 5:09
    What we do expect is that Google will show
  • 5:09 - 5:12
    up. If we get a 500 or a 400,
  • 5:12 - 5:15
    we'll probably start checking our internet
    connection first, before
  • 5:15 - 5:17
    we think, Google is down, right.
  • 5:17 - 5:20
    That's high, that's a form of high availability.
    To
  • 5:20 - 5:23
    give you another example - email. If you use
  • 5:23 - 5:26
    email on your phone, and your phone goes out
  • 5:26 - 5:28
    of range, you don't expect that you all of
  • 5:28 - 5:29
    the sudden won't be able to read your email
  • 5:29 - 5:32
    and not be able to write email. You expect
  • 5:32 - 5:34
    that when the internet connects again. Anything
    you've written
  • 5:34 - 5:37
    gets sent out. You know, and so that's another
  • 5:37 - 5:40
    form of high availability, that the key ingredient
    here
  • 5:40 - 5:43
    is there is not a global state that exists,
  • 5:43 - 5:47
    that is consistent, at any given point in
    time.
  • 5:47 - 5:52
    Of course, health records. You don't expect,
    when you
  • 5:52 - 5:55
    buy insurance or update your, your, or go
    to
  • 5:55 - 5:58
    the doctor, that immediately that's gonna
    be reflected on
  • 5:58 - 6:01
    your health record. But you do expect that
    when
  • 6:01 - 6:03
    you go to the health insurance website, it
    will
  • 6:03 - 6:05
    be up. And it will have, like, historical
    data
  • 6:05 - 6:08
    available to you.
  • 6:08 - 6:11
    So that's really interesting, because that's
    the expectation that
  • 6:11 - 6:14
    most of us have, most people familiar with
    the
  • 6:14 - 6:17
    internet have, about using the internet. Is
    that it's
  • 6:17 - 6:21
    highly available and not strongly consistent.
    It doesn't have
  • 6:21 - 6:23
    one global state at any given point in time.
  • 6:23 - 6:26
    And there's kind of this tension between the
    two.
  • 6:26 - 6:29
    Well if that's the expectation, the interesting
    thing is
  • 6:29 - 6:33
    sequel was designed to be strongly consistent.
  • 6:33 - 6:37
    So the database that we use intrinsically
    is not
  • 6:37 - 6:40
    the one that follows the paradigm that most
    of
  • 6:40 - 6:47
    our expectations using online applications
    follow. Other databases are
  • 6:47 - 6:50
    built with high availability in mind.
  • 6:50 - 6:52
    So they might not be strong consistent at
    a
  • 6:52 - 6:54
    given point in time. They might be eventually
    consistent.
  • 6:54 - 6:59
    There's allowance for data that hasn't propogated
    to all
  • 6:59 - 7:02
    of the machines in the cluster to get around
  • 7:02 - 7:05
    to it.
  • 7:05 - 7:09
    So to really hammer home this distinction,
    I'm gonna
  • 7:09 - 7:14
    focus on two different mindsets when we're
    building applications.
  • 7:14 - 7:18
    This is your brain SQL. And nobody's rushing
    the
  • 7:18 - 7:20
    screen, which is a good sign, so there's no
  • 7:20 - 7:24
    zombies in here.
  • 7:24 - 7:26
    When you're building - so, as an engineer,
    you
  • 7:26 - 7:28
    get a use case and you're gonna build this
  • 7:28 - 7:33
    application on SQL. Again, I'm just gonna
    generalize some
  • 7:33 - 7:36
    experience here. So you take the use case,
    you
  • 7:36 - 7:39
    figure out what data's in there - say, addresses,
  • 7:39 - 7:42
    companies, users, whatever. And you break
    those down into
  • 7:42 - 7:46
    tables. Figure out what the relationships
    between those tables
  • 7:46 - 7:49
    are, and denormalize your data.
  • 7:49 - 7:50
    Do a lot of other things that, again, we
  • 7:50 - 7:54
    have decades of institutional knowledge, how
    to structure tables
  • 7:54 - 7:58
    and rows and indexes in relational databases.
    And if
  • 7:58 - 8:02
    we do that right, according to best practices,
    if
  • 8:02 - 8:05
    we do that well, then we have a pretty
  • 8:05 - 8:09
    good level of confidense that when we go to
  • 8:09 - 8:12
    build an application on top of it, we'll be
  • 8:12 - 8:14
    able to get the data out in a way
  • 8:14 - 8:16
    that we want to present it to the client,
  • 8:16 - 8:19
    whether the client is another computer or
    person or
  • 8:19 - 8:20
    whatever.
  • 8:20 - 8:23
    OK. Model your data, do that well, then show
  • 8:23 - 8:26
    it to the client.
  • 8:26 - 8:31
    By contrast, when you're building, when, an
    application on
  • 8:31 - 8:34
    top of a key-value system, you have to have
  • 8:34 - 8:37
    a different mindset. And that mindset is,
    is kind
  • 8:37 - 8:39
    of the reverse. You look at the use case
  • 8:39 - 8:41
    and you focus on how is this gonna look
  • 8:41 - 8:42
    to the client?
  • 8:42 - 8:45
    Again, whether the client is another machine,
    or a
  • 8:45 - 8:48
    user or a terminal or whatever, how, how is
  • 8:48 - 8:51
    it gonna be presented at the end? If you
  • 8:51 - 8:54
    can figure that out, then modeling the data
    kind
  • 8:54 - 8:56
    of falls out, OK.
  • 8:56 - 8:58
    Figure out how it's gonna be presented, and
    then
  • 8:58 - 9:00
    with a high degree of confidense you know
    you'll
  • 9:00 - 9:03
    be able to model the data in a K/V.
  • 9:03 - 9:06
    It's not that one is better than the other
  • 9:06 - 9:10
    in terms of data modeling. The difference
    is that
  • 9:10 - 9:17
    SQL is more flexible, but harder to scale.
  • 9:19 - 9:23
    And that's not a principle, that's an observation.
    So
  • 9:23 - 9:27
    I'm not saying that in principle SQL is harder
  • 9:27 - 9:30
    to scale. But I will make the observation,
    that
  • 9:30 - 9:33
    the more sophisticated the query planner is
    in your
  • 9:33 - 9:37
    database, the more difficult it is to scale
    that
  • 9:37 - 9:40
    database in a way that's highly available
    or fault
  • 9:40 - 9:43
    tolerant in particular.
  • 9:43 - 9:46
    OK. K/V, that's the simplest kind of database
    you
  • 9:46 - 9:48
    could have, right. You give a database a key
  • 9:48 - 9:50
    and a value, it'll store the value. You give
  • 9:50 - 9:53
    it just the key, it gives back the value.
  • 9:53 - 9:56
    There's really no query planning going on
    there, so
  • 9:56 - 9:59
    the design of the database can do other interesting
  • 9:59 - 10:01
    things, like focus on making it highly available
    and
  • 10:01 - 10:05
    fault tolerant and scale horizontally.
  • 10:05 - 10:07
    So I want to put this in a perspective
  • 10:07 - 10:11
    that we can all relate to. So the situation
  • 10:11 - 10:14
    we've all had, well maybe you don't all take
  • 10:14 - 10:17
    the subway, but you know I get up, go
  • 10:17 - 10:20
    to work, get on the subway, and look over
  • 10:20 - 10:23
    and there's somebody else who's obviously
    going to work,
  • 10:23 - 10:29
    but they look a little not right.
  • 10:29 - 10:32
    We've all had this experience. We see that
    there's
  • 10:32 - 10:36
    a zombie on the subway and we know that,
  • 10:36 - 10:39
    you know, the apocalypse is upon us. So it's
  • 10:39 - 10:44
    a common theme in our careers. Zombies!
  • 10:44 - 10:49
    Right. So we've all had this experience, and
    you
  • 10:49 - 10:53
    know, maybe the acute zombielepsy breaks out,
    as it
  • 10:53 - 10:55
    does, and you know, maybe the zombies start
    here
  • 10:55 - 10:58
    in Miami, and you got to admit, some of
  • 10:58 - 11:03
    those zombies can run fast. So they spread
    and
  • 11:03 - 11:07
    pretty soon there's, let's say, a million
    zombies around
  • 11:07 - 11:08
    the country.
  • 11:08 - 11:13
    So, again, as frequently happens to me, and
    I'm
  • 11:13 - 11:16
    sure to all of you, the CDC comes and
  • 11:16 - 11:19
    says, hey, we need to get a handle on
  • 11:19 - 11:21
    this. We need to store all of this zombie
  • 11:21 - 11:25
    data. We need to do it quickly, so, you
  • 11:25 - 11:29
    know, we want an application built in Ruby,
    because
  • 11:29 - 11:32
    developer time is of the essence. We want
    this
  • 11:32 - 11:34
    to be agile. We don't know exactly what kind
  • 11:34 - 11:36
    of things we're gonna have to do with the
  • 11:36 - 11:39
    data.
  • 11:39 - 11:41
    And we have this database that we need to
  • 11:41 - 11:46
    store it in. OK. I'm sure anybody can do
  • 11:46 - 11:47
    this. This isn't too sophisticated.
  • 11:47 - 11:49
    What does the data look like? Well, here's
    an
  • 11:49 - 11:54
    example of a value that's just in JSON. SO
  • 11:54 - 11:58
    we've got some DNA, DNA sample, of the zombie.
  • 11:58 - 12:03
    Their gender, their name, address, city, zip.
    Weight, height,
  • 12:03 - 12:05
    latitude, longitude. I left some out - blood
    type,
  • 12:05 - 12:08
    phone number, social security number, stuff
    like that.
  • 12:08 - 12:12
    But the situation's a little bit more complicated.
    The
  • 12:12 - 12:16
    CDC actually has databases all around the
    country. And
  • 12:16 - 12:19
    they all need to store this data. And they're
  • 12:19 - 12:22
    kind of hooked up in a weird way because,
  • 12:22 - 12:25
    intertubes, mhmm.
  • 12:25 - 12:28
    So the data has to replicate between the data
  • 12:28 - 12:33
    centers like this, OK. This is, this is not
  • 12:33 - 12:39
    an uncommon situation with big data.
  • 12:39 - 12:41
    This can be done with the SQL database, but
  • 12:41 - 12:44
    it would kind of be a pain in the
  • 12:44 - 12:47
    ass so far, right, to set up the load
  • 12:47 - 12:49
    balancers and figure out the replication strategies.
  • 12:49 - 12:53
    Anyway, let's make it a little bit more interesting.
  • 12:53 - 12:56
    So what else could happen? Well, the CDC knows
  • 12:56 - 12:59
    that this could happen, right. The CDC is
    not
  • 12:59 - 13:03
    yet immune to the acute zombielepsy. So there
    could
  • 13:03 - 13:07
    be a scenario in which we lose data centers.
  • 13:07 - 13:09
    I know what you're thinking - this never happens,
  • 13:09 - 13:16
    right. The whole East coast never goes down.
  • 13:16 - 13:18
    So if that happens, then we lose those connections.
  • 13:18 - 13:22
    But, you know, if the human race is to
  • 13:22 - 13:24
    survive, we can't just ignore these guys out
    here.
  • 13:24 - 13:28
    They have to be able to continue to accept
  • 13:28 - 13:30
    reads and writes.
  • 13:30 - 13:33
    And this is one of the stricter definitions
    of
  • 13:33 - 13:36
    high availability, which is that if you can
    connect
  • 13:36 - 13:38
    to any part of the system, any part of
  • 13:38 - 13:44
    the database, it will accept reads and writes.
    OK,
  • 13:44 - 13:47
    it'll serve you whatever data it has access
    to,
  • 13:47 - 13:50
    and if you write data, it will contain, it
  • 13:50 - 13:54
    will take it.
  • 13:54 - 13:56
    That's very difficult to do with the SQL database.
  • 13:56 - 13:58
    SQL databases just aren't designed to do that
    sort
  • 13:58 - 13:58
    of thing.
  • 13:58 - 14:01
    But a lot of the other databases are.
  • 14:01 - 14:07
    OK, so this is, this is getting kind of
  • 14:07 - 14:10
    interesting. So let's take another, a closer
    look at
  • 14:10 - 14:17
    that database. Well, it's big.
  • 14:19 - 14:23
    We're storing DNA, right, so that's about,
    your genome,
  • 14:23 - 14:25
    well, I don't know about yours. Everybody
    else's genome
  • 14:25 - 14:32
    is about 1.5 gigabytes per person. So 1.5
    gigs,
  • 14:32 - 14:33
    that's getting to be a large database. It
    won't
  • 14:33 - 14:35
    fit on one server.
  • 14:35 - 14:37
    So we're gonna have to store it on several
  • 14:37 - 14:39
    servers.
  • 14:39 - 14:43
    Again, other databases are designed to do
    this. They
  • 14:43 - 14:47
    will automatically do one of two things. They'll
    either
  • 14:47 - 14:51
    have a logical router that knows by looking
    at
  • 14:51 - 14:54
    the key, where the data's supposed to be stored,
  • 14:54 - 14:56
    or they have some sort of meta data server
  • 14:56 - 14:58
    that keeps track of where all of the object
  • 14:58 - 15:00
    and the database are stored. Those are the
    two
  • 15:00 - 15:05
    major paradigms for how a distributed database
    stores data.
  • 15:05 - 15:09
    And this also has to be fault tolerant.
  • 15:09 - 15:13
    Let me just put a definition on that phrase,
  • 15:13 - 15:17
    fault tolerant. That's the optimistic view
    that stuff is
  • 15:17 - 15:21
    gonna happen, that bad stuff is gonna happen.
    Optimistically,
  • 15:21 - 15:24
    if you have a fault tolerant system, you're
    expecting
  • 15:24 - 15:28
    bad things to happen. So in this case, server's
  • 15:28 - 15:29
    gonna die.
  • 15:29 - 15:32
    It's gonna catch on fire. The barings in the
  • 15:32 - 15:34
    harddrive are gonna dip, the rate is gonna
    dive
  • 15:34 - 15:36
    - something, something's gonna happen and
    a server's gonna
  • 15:36 - 15:37
    die.
  • 15:37 - 15:40
    And in a fault tolerant system, that's OK.
    It
  • 15:40 - 15:46
    continues to run. In some cases worse, a cable
  • 15:46 - 15:49
    comes unplugged. You know, zombies get into
    the data
  • 15:49 - 15:51
    center, chasing the ops guys and trip over
    a
  • 15:51 - 15:53
    cable, right. And so now we've got part of
  • 15:53 - 15:55
    the cluster is connected and another part
    of the
  • 15:55 - 15:58
    cluster is not connected, OK.
  • 15:58 - 16:02
    Again, I'll leave time for questions - we
    can
  • 16:02 - 16:05
    talk more about how databases have different
    strategies of
  • 16:05 - 16:09
    dealing with fault tolerance and replication
    and anti-entropy -
  • 16:09 - 16:16
    entropy being when data sets get out of sync.
  • 16:17 - 16:20
    But let's continue talking about the requirements.
    OK.
  • 16:20 - 16:24
    So we're gonna store this as key/value data
    model,
  • 16:24 - 16:28
    thank you mister, or CDC person. So this is
  • 16:28 - 16:31
    the value - we need a key.
  • 16:31 - 16:35
    Well, fortunately they have a system for establishing
    a
  • 16:35 - 16:39
    UUID for a key, so in this case it'll
  • 16:39 - 16:42
    be patient0. When we want to look up this
  • 16:42 - 16:46
    data, we give the system patient0, and it
    gives
  • 16:46 - 16:48
    us back the data that we need to, to
  • 16:48 - 16:50
    do research on.
  • 16:50 - 16:53
    And CDC person says, oh, and also I want
  • 16:53 - 16:55
    to sometimes look it up by zip code. I
  • 16:55 - 16:58
    want all of the zombies in a given zip
  • 16:58 - 16:59
    code.
  • 16:59 - 17:04
    OK. A strict key value doesn't have a second
  • 17:04 - 17:06
    way of looking things up. So here's where
    we
  • 17:06 - 17:11
    have to start consider, considering data modeling.
    We can
  • 17:11 - 17:14
    always look this record up by the key patient0,
  • 17:14 - 17:16
    that makes sense. How do we look it up
  • 17:16 - 17:19
    by 33436, that zip code?
  • 17:19 - 17:21
    Well, if we've got the data on a bunch
  • 17:21 - 17:26
    of servers, right, the zombie data comes in,
    we
  • 17:26 - 17:27
    store it on one machine - in this case,
  • 17:27 - 17:31
    the upper right machine. And the system knows
    that
  • 17:31 - 17:36
    patient0 key points to that machine. So that's
    how
  • 17:36 - 17:41
    it knows to get that data.
  • 17:41 - 17:45
    But say we have these three zombies in that
  • 17:45 - 17:47
    zip code. We don't have a way of asking
  • 17:47 - 17:52
    key value system how to find those three zombies
  • 17:52 - 17:54
    who are in the zip code.
  • 17:54 - 17:58
    The solution is to create an inverted index.
    This
  • 17:58 - 18:02
    is an inverted index because a field from
    the
  • 18:02 - 18:06
    data in the value of the object is pointing
  • 18:06 - 18:10
    back to the objects that contain it. So that's
  • 18:10 - 18:12
    the inversion.
  • 18:12 - 18:14
    And the index is really simple, in this case,
  • 18:14 - 18:16
    we'll just say, hey, and object with a key
  • 18:16 - 18:21
    zip underscore 33436 contains the objects
    patient0, patient45, and
  • 18:21 - 18:25
    patient3924.
  • 18:25 - 18:29
    We're getting more zombies here.
  • 18:29 - 18:31
    OK.
  • 18:31 - 18:36
    So how do we store that index?
  • 18:36 - 18:39
    We know that this is zip object should point
  • 18:39 - 18:44
    to those three zombie values. So if we represent
  • 18:44 - 18:47
    that index as this green star thing, where
    do
  • 18:47 - 18:51
    we put it? We have two options.
  • 18:51 - 18:54
    One is, when the zombie object comes into
    the
  • 18:54 - 18:58
    system, we save the index on the same machine
  • 18:58 - 19:02
    with that document. This is called document
    based inverted
  • 19:02 - 19:05
    indexing, because we're partitioning the index
    with the document,
  • 19:05 - 19:08
    the object that we're indexing.
  • 19:08 - 19:13
    So as time goes on, we now have an
  • 19:13 - 19:19
    index for zip 33, 33436, on the upper right
  • 19:19 - 19:21
    machine, and on this lower left machine, because
    those
  • 19:21 - 19:24
    all have zombies in that zip code.
  • 19:24 - 19:27
    OK, let's think about the performance implications
    here. We
  • 19:27 - 19:33
    save an object, the system locally indexes
    it, and
  • 19:33 - 19:39
    saves that index locally. Super efficient
    for a write,
  • 19:39 - 19:40
    right.
  • 19:40 - 19:43
    Save the object, index it yourself. OK, that's
    easily
  • 19:43 - 19:45
    done. Now how do we read it?
  • 19:45 - 19:47
    Well we have to go to each of the
  • 19:47 - 19:50
    machines and say, you know, top one - do
  • 19:50 - 19:52
    you have anybody in that zip? Nope. Second
    one
  • 19:52 - 19:54
    - yup, I've got these two guys. OK.
  • 19:54 - 19:58
    Nope. Yup, one. Nope. And then we put together
  • 19:58 - 20:00
    the results in one answer. That was a really
  • 20:00 - 20:05
    inefficient read, right. But that's, but it
    was inefficient
  • 20:05 - 20:08
    right. SO that's one way to do it.
  • 20:08 - 20:10
    Another way to do it is a zombie comes
  • 20:10 - 20:14
    in, we index them before we save them, and
  • 20:14 - 20:17
    then we save the index someplace else, a specific
  • 20:17 - 20:20
    place in the cluster. OK, this is called term
  • 20:20 - 20:23
    based inverted index. So it's a different
    partitioning scheme
  • 20:23 - 20:25
    for the index. We're partitioning by the term
    that
  • 20:25 - 20:28
    we're searching on.
  • 20:28 - 20:31
    More zombies come in. And it's always updating
    that
  • 20:31 - 20:34
    same object, which is someplace else in the
    cluster.
  • 20:34 - 20:37
    So let's think about the performance profile
    of this.
  • 20:37 - 20:40
    For every zombie value that we write to the
  • 20:40 - 20:43
    database, we have to fetch the index from
    some
  • 20:43 - 20:45
    place else, add to it, and write to it
  • 20:45 - 20:49
    and save it back. So that's an inefficient
    write
  • 20:49 - 20:51
    pattern, but now look at the read. When we
  • 20:51 - 20:53
    want to know who's in that zip, we only
  • 20:53 - 20:59
    have to read from one machine.
  • 20:59 - 21:02
    So these are the two. We got document-based
    inverted
  • 21:02 - 21:06
    index and term-based inverted index. These
    are the two
  • 21:06 - 21:09
    paradigms for inverted indexes that we've
    considered.
  • 21:09 - 21:14
    Again, document-based: fast read - fast write,
    inefficient read.
  • 21:14 - 21:16
    And term-based inverted index has a fast read
    but
  • 21:16 - 21:18
    an inefficient write.
  • 21:18 - 21:21
    The point being that, when we look at the
  • 21:21 - 21:26
    use case, that's what should determine how
    we model
  • 21:26 - 21:28
    the data.
  • 21:28 - 21:32
    We have to understand from the CDC person,
    is
  • 21:32 - 21:35
    this data gonna be written a lot, this index?
  • 21:35 - 21:36
    Or is it gonna be read a lot?
  • 21:36 - 21:41
    OK. It's a much different way of looking at
  • 21:41 - 21:44
    the problem than using a relational database,
    where we
  • 21:44 - 21:49
    just would have indexed that as a separate
    thing,
  • 21:49 - 21:52
    as a secondary index in a relational database.
  • 21:52 - 21:56
    So, consider- it's an important distinction
    in this kind
  • 21:56 - 21:59
    of thing, and if you like charts, I'll link
  • 21:59 - 22:03
    to some charts later that show some comparisons
    that
  • 22:03 - 22:07
    we did between different distributed databases
    and the way
  • 22:07 - 22:13
    that, or different partitioning schemes in
    a distributed database.
  • 22:13 - 22:16
    So back to zombies.
  • 22:16 - 22:18
    Because these guys haven't stopped eating
    brains yet.
  • 22:18 - 22:22
    All right, so in this scenario, where we've
    got
  • 22:22 - 22:26
    two data centers down, three are still up,
    two
  • 22:26 - 22:33
    are connected. Consider the situation where
    data comes in
  • 22:33 - 22:40
    to the top data center, writing to record
    patient0.
  • 22:40 - 22:42
    And somebody down in Southern California also
    writes data
  • 22:42 - 22:49
    to patient0.
  • 22:49 - 22:53
    How would you handle this in a SQL database?
  • 22:53 - 22:57
    Let's make the problem worse. Then because,
    you know,
  • 22:57 - 23:00
    the crisis is happening, somebody runs along
    cable between
  • 23:00 - 23:02
    those two data centers, and connects them
    and so
  • 23:02 - 23:04
    now they have to replicate their data with
    each
  • 23:04 - 23:09
    other.
  • 23:09 - 23:11
    How do we reconcile these two different versions
    of
  • 23:11 - 23:13
    patient0?
  • 23:13 - 23:17
    First, first attempt might be, OK, let's take
    the
  • 23:17 - 23:21
    one with the last time stamp on it. Two
  • 23:21 - 23:24
    problems with that. One, obviously, you might
    lose data.
  • 23:24 - 23:28
    Two, time stamps in a distributed system are
    entirely
  • 23:28 - 23:31
    unreliable, OK.
  • 23:31 - 23:33
    If you want to sync your clocks in a
  • 23:33 - 23:36
    distributed system, that's a particular kind
    of pain that
  • 23:36 - 23:40
    I just wouldn't want to get into.
  • 23:40 - 23:46
    So in the simplest case, we have a system
  • 23:46 - 23:49
    that has what are called siblings for that
    particular
  • 23:49 - 23:56
    key/value object. And basically the system
    would get two
  • 23:56 - 23:58
    versions of data and just say, oh, I don't
  • 23:58 - 24:00
    know which is which, so I'm just gonna store
  • 24:00 - 24:02
    both. And then when you go to read it,
  • 24:02 - 24:04
    it says, well I've got this value and this
  • 24:04 - 24:09
    value. These siblings. Here.
  • 24:09 - 24:11
    And at the application level, we can do a
  • 24:11 - 24:13
    couple things with that, right. In the simplest
    case,
  • 24:13 - 24:16
    none of the data overlaps. So we can just
  • 24:16 - 24:18
    combine them, right.
  • 24:18 - 24:22
    None of that data overlaps. None of it overwrote
  • 24:22 - 24:23
    it, so we just combined them.
  • 24:23 - 24:25
    What if that's not the case? Well then we
  • 24:25 - 24:26
    can do a couple things. We can write our
  • 24:26 - 24:30
    own policy. We could just present both versions
    to
  • 24:30 - 24:32
    CDC person on screen, and say I don't know
  • 24:32 - 24:35
    - you pick, right.
  • 24:35 - 24:37
    This is a problem that I didn't have to
  • 24:37 - 24:40
    think about as an engineer with the SQL database,
  • 24:40 - 24:41
    because it was impossible to have siblings
    in a
  • 24:41 - 24:44
    SQL database.
  • 24:44 - 24:50
    But with this highly available, huge fault
    tolerant database,
  • 24:50 - 24:51
    this kind of stuff has to be considered.
  • 24:51 - 24:56
    There's a whole field of research into what
    are
  • 24:56 - 25:00
    called CRDT, commative or convergent replicated
    data types, that
  • 25:00 - 25:06
    specifically analyzes the, the, specifies
    the kinds of data
  • 25:06 - 25:12
    structures that you can build that you can
    automatically
  • 25:12 - 25:17
    converge into one value without human intervention,
    right, without
  • 25:17 - 25:21
    any side effects or conflicts.
  • 25:21 - 25:24
    And that's an on-going field of study. We
    don't,
  • 25:24 - 25:26
    we haven't, like, numerated all of the possible
    data
  • 25:26 - 25:29
    types that we can do that for. I'll give
  • 25:29 - 25:34
    you an example of one simple one.
  • 25:34 - 25:40
    Think of an array that has unique values and
  • 25:40 - 25:44
    only grows. It only gets bigger. YOu never
    remove
  • 25:44 - 25:46
    a value from it. This is kind of the
  • 25:46 - 25:50
    simplest case for CRDT. So say up in the
  • 25:50 - 25:55
    north west, we had somebody write this object
    with
  • 25:55 - 26:00
    patient0, 45, 3924, and in the southwest,
    we had
  • 26:00 - 26:03
    somebody write this object that the zip index
    with
  • 26:03 - 26:09
    patient73 and 9217.
  • 26:09 - 26:11
    If we want to combine these two, since we
  • 26:11 - 26:14
    know none of those objects can ever be taken
  • 26:14 - 26:17
    out, we can simply add them all together into
  • 26:17 - 26:21
    the list and call unique on it, right. Problem
  • 26:21 - 26:23
    solved.
  • 26:23 - 26:26
    What if you were gonna allow items to be
  • 26:26 - 26:31
    removed. Problem gets a lot more difficult.
  • 26:31 - 26:34
    And that's a whole other area of research
    that,
  • 26:34 - 26:36
    again, is very interesting. Other topics - I
    want
  • 26:36 - 26:39
    to take some questions, so I won't get into
  • 26:39 - 26:42
    this too much, but. Other research topics
    with distributed
  • 26:42 - 26:49
    systems: GeoHashes, right. If you want to
    store longitutde
  • 26:51 - 26:54
    and latitude for an object, and then you want
  • 26:54 - 26:56
    to know, hey, find me all of the things
  • 26:56 - 27:01
    that are within a mile, if that's on one
  • 27:01 - 27:05
    computer, we have algorithms that do that.
  • 27:05 - 27:07
    What if you're trying to contain that data
    set
  • 27:07 - 27:10
    on many computers? That becomes a much more
    difficult
  • 27:10 - 27:13
    problem, because you can't just ask one computer,
    give
  • 27:13 - 27:16
    me all the thigns within a mile, because some
  • 27:16 - 27:18
    of those things might be on another computer.
    So
  • 27:18 - 27:21
    what do you do, ask all of the computers?
  • 27:21 - 27:24
    That's an inefficient read. A lot of interesting
    research
  • 27:24 - 27:28
    going on there, and acid transactions. The
    'I' is
  • 27:28 - 27:31
    lowercase on purpose.
  • 27:31 - 27:35
    So in a highly available system, you know
    a
  • 27:35 - 27:41
    couple years ago people - I was asked, can
  • 27:41 - 27:44
    you have acid compliant transactions on top
    of a
  • 27:44 - 27:47
    highly available system, and the general consensus
    was no
  • 27:47 - 27:51
    - as little as two years ago.
  • 27:51 - 27:54
    Now, we know that that's not the case. It
  • 27:54 - 27:57
    turns out that the 'i' in ACiD actually means
  • 27:57 - 28:01
    a lot of things. And in a sequel database,
  • 28:01 - 28:06
    you probably think it means that one transaction
    starts,
  • 28:06 - 28:10
    does stuff, stops, and then another one starts,
    does
  • 28:10 - 28:13
    stuff, stops.
  • 28:13 - 28:16
    Most SQL databases that we use here - again,
  • 28:16 - 28:20
    I'm just making a huge generalization - don't
    actually
  • 28:20 - 28:25
    work that way. Or they rely on serialization
    at
  • 28:25 - 28:27
    the resource level to give you that kind of
  • 28:27 - 28:31
    result. But they have other levels of protection
    called,
  • 28:31 - 28:35
    you know, repeatable reads, or read committed,
    where, when
  • 28:35 - 28:40
    one transaction starts, maybe another one
    starts, too, and
  • 28:40 - 28:43
    does some work before the first one ends.
  • 28:43 - 28:46
    OK, by default, a lot of the databases, the
  • 28:46 - 28:50
    SQL databases that we use do that. That kind
  • 28:50 - 28:53
    of transaction, you can build on top of a
  • 28:53 - 28:57
    highly available system. And you can model
    your data
  • 28:57 - 28:59
    in a way to give you those properties if
  • 28:59 - 29:05
    you require them.
  • 29:05 - 29:11
    So to sum up, keep calm. Always bring a
  • 29:11 - 29:12
    towel. The fate of the human race depends
    on
  • 29:12 - 29:19
    you. Distributed data, distributed databases
    like this, highly available
  • 29:20 - 29:24
    databases, can help you survive the next zombie
    apocalypse.
  • 29:24 - 29:26
    We've all been through them before. There's
    no reason
  • 29:26 - 29:29
    for us to expect we won't go through more
  • 29:29 - 29:31
    in the future.
  • 29:31 - 29:36
    And, I will, again, Tweet links to this, or
  • 29:36 - 29:37
    you can just copy it - it's zombies dot
  • 29:37 - 29:40
    samples dot basher dot com.
  • 29:40 - 29:44
    We have a, an application that uses the data
  • 29:44 - 29:47
    modeling that I talked about online. It has
    a
  • 29:47 - 29:50
    maps that you can see where the zombies in
  • 29:50 - 29:54
    the area are, and it allows you to search
  • 29:54 - 29:57
    using those two different indexing methods
    that I mentioned:
  • 29:57 - 30:02
    term based inverted indexing, and document
    based inverted indexing
  • 30:02 - 30:05
    for the zombies in a given zip code.
  • 30:05 - 30:10
    You can also search for them by GeoHash. There
  • 30:10 - 30:16
    is a blog post describing how that's done.
    And
  • 30:16 - 30:19
    the code is available in Ruby, and a little
  • 30:19 - 30:24
    bit of JavaScript. OK.
  • 30:24 - 30:27
    And that is my presentation. So I would like
  • 30:27 - 30:30
    to thank four of my colleagues - Drew Kerrigan,
  • 30:30 - 30:34
    Dan Kerrigan, for writing the application
    that I spoke
  • 30:34 - 30:36
    about. Nathan Aschbacher for some of the material
    and
  • 30:36 - 30:41
    John Newman for the artwork. And I will include
  • 30:41 - 30:45
    links to all the things that I just referenced
  • 30:45 - 30:49
    there in Twitter, shortly.
  • 30:49 - 30:53
    So my time is up. Find me in the
  • 30:53 - 30:54
    halls. And good luck.
Title:
Ruby Conf 2013 - Fault Tolerant Data: Surviving the Zombie Apocalypse
Description:

more » « less
Duration:
31:24

English subtitles

Revisions