Problem connecting to Twitter. Please try again.
Problem connecting to Twitter. Please try again.
Problem connecting to Twitter. Please try again.
Problem connecting to Twitter. Please try again.
Problem connecting to Twitter. Please try again.

Return to Video

How Uber Uses your Phone as a Backup Datacenter

  • 0:07 - 0:11
    (audience claps)
    (Presenter) Ok.
  • 0:11 - 0:15
    Alright everybody so, let's dive in.
  • 0:15 - 0:18
    So let's talk about how Uber trips
    even happen
  • 0:18 - 0:20
    before we get into the nitty gritty
    of how
  • 0:20 - 0:23
    we save them from the clutches
    of a datacenter failover.
  • 0:23 - 0:27
    So you might have heard we're all about
    connecting riders and drivers.
  • 0:27 - 0:28
    This is what it looks like.
  • 0:28 - 0:30
    You've probably at least
    seen the rider app.
  • 0:30 - 0:32
    You get it out, you see
    some cars on the map.
  • 0:32 - 0:34
    You pick where you want
    to get a pickup location.
  • 0:34 - 0:36
    At the same time,
    all these guys that
  • 0:36 - 0:39
    you're seeing on the map,
    they have a phone open somewhere.
  • 0:39 - 0:41
    They're logged in waiting for a dispatch.
  • 0:41 - 0:43
    They're all pinging
    into the same datacenter
  • 0:43 - 0:45
    for the city that you both are in.
  • 0:45 - 0:48
    So then what happens is you put
    that pin somewhere,
  • 0:48 - 0:51
    and you get ready to pick up your trip,
    you get ready to request.
  • 0:52 - 0:55
    You hit request and
    that guy's phone starts beeping.
  • 0:55 - 0:58
    Hopefully if everything works out,
    he'll accept that trip.
  • 0:59 - 1:03
    All these things that we're talking about
    here, the request of a trip,
  • 1:03 - 1:06
    the offering it to a driver,
    him accepting it.
  • 1:06 - 1:09
    That's something we call
    a state change transition.
  • 1:09 - 1:12
    From the moment that you start
    requesting the trip,
  • 1:12 - 1:16
    we start creating your trip data
    in the backend datacenter.
  • 1:16 - 1:21
    And that transaction that might live
    for anything like 5, 10, 15, 20, 30,
  • 1:21 - 1:24
    however many minutes it takes you
    to take your trip.
  • 1:24 - 1:27
    We have to consistently handle that trip
    to get it all the way through
  • 1:27 - 1:31
    to completion to get you
    where you're going happily.
  • 1:31 - 1:33
    So every time this state change happens,
  • 1:37 - 1:41
    things happen in the world, so next up
    he goes ahead and shows up to you.
  • 1:42 - 1:45
    He arrives, you get in the car,
    he begins the trip.
  • 1:45 - 1:47
    Everything's going fine.
  • 1:49 - 1:52
    So this is of course...
    some of these state changes
  • 1:52 - 1:55
    are more or less important
    to everything that's going on.
  • 1:55 - 1:58
    The begin trip and the end trip are the
    real important ones, of course.
  • 1:58 - 2:00
    The ones that we don't want
    to lose the most.
  • 2:00 - 2:02
    But all these are really important
    to keep onto.
  • 2:02 - 2:07
    So what happens in the sense of
    failure is your trip is gone.
  • 2:07 - 2:11
    You're both back to,
    "oh my god, where'd my trip go?"
  • 2:11 - 2:15
    There you're just seeing empty cars again
    and he's back into an open thing
  • 2:15 - 2:18
    like where you were when you
    started/opened the application
  • 2:18 - 2:18
    in the first place.
  • 2:18 - 2:22
    So this is what used to happen for us
    not too long ago.
  • 2:22 - 2:26
    So how do we fix this,
    how do you fix this in general?
  • 2:26 - 2:28
    So classically you might try
    and say,
  • 2:28 - 2:31
    "Well, let's take all the data in that one
    datacenter and copy it,
  • 2:31 - 2:34
    replicate it to a backend data center."
  • 2:34 - 2:37
    This is pretty well understood, classic
    way to solve this problem.
  • 2:37 - 2:40
    You control the active data center
    and the backup data center
  • 2:40 - 2:42
    so it's pretty easy to reason about.
  • 2:42 - 2:44
    People feel comfortable with this scheme.
  • 2:44 - 2:47
    It could work more or less well depending
    on what database you're using.
  • 2:48 - 2:49
    But there's some drawbacks.
  • 2:49 - 2:51
    It gets kinda complicated
    beyond two datacenters.
  • 2:51 - 2:55
    It's always gonna be subject
    to replication lag
  • 2:55 - 2:58
    because the datacenters are separated by
    this thing called the internet,
  • 2:58 - 3:01
    or maybe leased lines
    if you get really into it.
  • 3:02 - 3:07
    So it requires a constant level of high
    bandwidth, especially if you're not using
  • 3:07 - 3:10
    a database well-suited to replication,
    or if you haven't really tuned
  • 3:10 - 3:14
    your business model to get
    the deltas really good.
  • 3:14 - 3:17
    So, we chose not to go with this route.
  • 3:17 - 3:20
    We instead said, "What if we could solve
    it to vac down to the driver phone?"
  • 3:20 - 3:24
    Because since we're already in constant
    communication with these driver phones,
  • 3:24 - 3:27
    what if we could just save the data there
    to the driver phone?
  • 3:27 - 3:31
    Then he could failover to any datacenter,
    rather than having to control,
  • 3:31 - 3:34
    "Well here's the backup datacenter for
    this city, the backup datacenter
  • 3:34 - 3:39
    for this city," and then, "Oh, no no, what
    if in a failover, we fail the wrong phones
  • 3:39 - 3:42
    to the wrong datacenter and now we lose
    all their trips again?"
  • 3:42 - 3:43
    That would not be cool.
  • 3:43 - 3:47
    So we really decided to go with this
    mobile implementation approach
  • 3:47 - 3:50
    of saving the trips to the driver phone.
  • 3:50 - 3:53
    But of course, it doesn't come without a
    trade-off, the trade-off here being
  • 3:53 - 3:57
    you've got to implement some kind of a
    replication protocol in the driver phone
  • 3:57 - 4:01
    consistently between whatever
    platforms you support.
  • 4:01 - 4:03
    In our case, iOS and Android.
  • 4:06 - 4:08
    ...But if it could, how would this work?
  • 4:08 - 4:11
    So all these state transitions
    are happening when the phones
  • 4:11 - 4:13
    communicate with our datacenter.
  • 4:13 - 4:18
    So if in response to his request to begin
    trip or arrive, or accept, or any of this,
  • 4:18 - 4:22
    if we could send some data back down to
    his phone and have him keep ahold of it,
  • 4:22 - 4:27
    then in the case of a datacenter failover,
    when his phone pings into
  • 4:27 - 4:31
    the new datacenter, we could request that
    data right back off of his phone, and get
  • 4:31 - 4:34
    you guys right back on your trip with
    maybe only a minimal blip,
  • 4:34 - 4:36
    in the worst case.
  • 4:37 - 4:41
    So in a high level, that's the idea, but
    there are some challenges of course
  • 4:41 - 4:42
    in implementing that.
  • 4:43 - 4:46
    Not all the trip information that we would
    want to save is something we want
  • 4:47 - 4:50
    the driver to have access to, like to be
    able to get your trip and end it
  • 4:50 - 4:54
    consistently in the other datacenter, we'd
    have to have the full rider information.
  • 4:54 - 4:58
    If you're fare splitting with some friends
    it would need to be all rider information.
  • 4:58 - 5:02
    So, there's a lot of things that we need
    to save here to save your trip
  • 5:02 - 5:04
    that we don't want to expose
    to the driver.
  • 5:04 - 5:07
    Also, you have to pretty much assume that
    the driver phones are more or less
  • 5:07 - 5:11
    trustable, either because people are doing
    nefarious things with them,
  • 5:11 - 5:13
    or people not the drivers
    have compromised them,
  • 5:13 - 5:16
    or somebody else between you
    and the driver. Who knows?
  • 5:16 - 5:20
    So for most of these reasons, we decided
    we had to go with the crypto approach
  • 5:20 - 5:23
    and encrypt all the data that we store on
    the phones to prevent against tampering
  • 5:23 - 5:25
    and leak of any kind of PII.
  • 5:27 - 5:32
    And also, towards all these security
    designs and also simple reliability of
  • 5:32 - 5:36
    interacting with these phones, you want to
    keep the replication protocol as simple
  • 5:36 - 5:40
    as possible to make it easy to reason
    about, easy to debug, remove failure cases
  • 5:40 - 5:43
    and you also want to
    minimize the extra bandwidth.
  • 5:44 - 5:46
    I kinda glossed over the bandwidth unpacks
  • 5:46 - 5:48
    when I said backend replication
    isn't really an option.
  • 5:48 - 5:51
    But at least here when you're designing
    this replication protocol,
  • 5:51 - 5:55
    at the application layer you can be
    much more in tune with what data
  • 5:55 - 5:57
    you're serializing, and what
    you're deltafying or not
  • 5:57 - 6:00
    and really mind your bandwidth impact.
  • 6:01 - 6:05
    Especially since it's going over a mobile
    network, this becomes really salient.
  • 6:06 - 6:08
    So, how do you keep it simple?
  • 6:08 - 6:12
    In our case, we decided to go with a very
    simple key value store with all your
  • 6:12 - 6:17
    typical operations: get, set, delete,
    and list all the keys please,
  • 6:17 - 6:21
    with one caveat being you can only
    set a key once so you can't
  • 6:21 - 6:25
    accidentally overwrite a key;
    it eliminates a whole class
  • 6:25 - 6:29
    of weird programming errors or
    out of order message delivery errors
  • 6:29 - 6:32
    that you might have in such a system.
  • 6:32 - 6:35
    This did however then force us to move,
    we'll call "versioning"
  • 6:35 - 6:37
    into the keyspace though.
  • 6:37 - 6:41
    You can't just say, "Oh, I've got a key
    for this trip and please update it
  • 6:41 - 6:42
    to the new version on each state change."
  • 6:42 - 6:47
    No; instead you have to have a key for
    trip and version, and you have to do
  • 6:47 - 6:51
    a set of new one, delete the old one,
    and that at least gives you the nice
  • 6:51 - 6:54
    property that if that fails partway
    through, between the send and the delete,
  • 6:54 - 6:58
    you fail into having two things stored,
    rather than no things stored.
  • 6:59 - 7:02
    So there are some nice properties to
    keeping a nice simple
  • 7:02 - 7:04
    key-value protocol here.
  • 7:04 - 7:09
    And that makes failover resolution really
    easy because it's simply a matter of,
  • 7:09 - 7:11
    "What keys do you have?
    What trips do you store?
  • 7:11 - 7:13
    What keys do I have in the backend
    datacenter?"
  • 7:13 - 7:18
    Compare those, and come to a resolution
    between those set of trips.
  • 7:19 - 7:23
    So that's a quick overview
    of how we built this system.
  • 7:23 - 7:26
    My colleague here, Nikunj Aggarwal
    is going to give you a rundown
  • 7:26 - 7:30
    of some more details of how we
    really got the reliability of this system
  • 7:30 - 7:31
    to work at scale.
  • 7:33 - 7:35
    (audience claps)
  • 7:38 - 7:41
    Alright, hi! I'm Nikunj.
  • 7:41 - 7:45
    So we talked about the idea
    and the motivation behind the idea;
  • 7:45 - 7:49
    now let's dive into how did we design such
    a solution, and what kind of tradeoffs
  • 7:49 - 7:52
    did we have to make
    while we were doing the design.
  • 7:52 - 7:57
    So first thing we wanted to ensure was
    that the system we built is non-blocking
  • 7:57 - 8:00
    but still provide eventual consistency.
  • 8:00 - 8:05
    So basically...any backend application
    using this system should be able to make
  • 8:05 - 8:08
    further progress, even when
    the system is down.
  • 8:08 - 8:13
    So the only trade-off the application
    should be making is that it may
  • 8:13 - 8:16
    take some time for the data
    to actually be stored on the phone.
  • 8:16 - 8:19
    However, using this application
    should not affect
  • 8:19 - 8:21
    any normal business operations for them.
  • 8:22 - 8:27
    Secondly, we wanted to have an ability
    to move between datacenters without
  • 8:27 - 8:30
    worrying about data already there.
  • 8:30 - 8:33
    So when we failover from
    one datacenter to another,
  • 8:33 - 8:37
    that datacenter still had states in there,
  • 8:37 - 8:42
    and it still has its view
    of active drivers and trips.
  • 8:42 - 8:46
    And no service in that datacenter is aware
    that a failure actually happened.
  • 8:46 - 8:53
    So at some later time, if we fail back to
    the same datacenter, then its view
  • 8:53 - 8:57
    of the drivers and trips may be actually
    different than what the drivers
  • 8:57 - 9:01
    actually have and if we trusted that
    datacenter, then the drivers may get on a
  • 9:01 - 9:04
    stale trip, which is
    a very bad experience.
  • 9:04 - 9:08
    So we need some way to reconcile that data
    between the drivers and the server.
  • 9:08 - 9:14
    Finally, we want to be able to measure
    the success of the system all the time.
  • 9:14 - 9:20
    So the system is only fully executed
    during a failure, and a datacenter failure
  • 9:20 - 9:23
    is a pretty rare occurrence, and we don't
    want to be in a situation where
  • 9:23 - 9:27
    we detect issues with the system when we
    need it the most.
  • 9:27 - 9:32
    So what we want is an ability to
    constantly be able to measure the success
  • 9:32 - 9:37
    of the system so that we are confident in
    it when a failure acutally happens.
  • 9:37 - 9:42
    So in keeping all these issues in mind,
    this is a very high level
  • 9:42 - 9:44
    view of the system.
  • 9:44 - 9:47
    I'm not going to go into details of any
    of the services,
  • 9:47 - 9:49
    since it's a mobile conference.
  • 9:49 - 9:52
    So the first thing that happens is that
    driver makes an update,
  • 9:52 - 9:56
    or as Josh called it, a state change,
    on his app.
  • 9:56 - 9:59
    For example, he may pick up a passenger.
  • 9:59 - 10:02
    Now that update comes as a request to the
    dispatching service.
  • 10:02 - 10:08
    Now the dispatching service, depending on
    the type of request, it updates the trip
  • 10:08 - 10:13
    model for that trip, and then it sends the
    update to the replication service.
  • 10:13 - 10:18
    Now the replication service will enqueue
    that request in its own datastore
  • 10:18 - 10:23
    and immediately return a successful
    response to the dispatching service,
  • 10:23 - 10:27
    and then finally the dispatching service
    will update its own datastore
  • 10:27 - 10:30
    and then return a success to mobile.
  • 10:30 - 10:34
    It may alter it in some other way on the
    mobile, for example, things might have
  • 10:34 - 10:37
    changed since the last time
    mobile pinged in, for example...
  • 10:39 - 10:42
    ...If it's an UberPool trip,
    then the driver may have to pick up
  • 10:42 - 10:44
    another passenger.
  • 10:44 - 10:46
    Or if the rider entered some destination,
  • 10:46 - 10:50
    we might have to tell
    the driver about that.
  • 10:50 - 10:54
    And in the background, the replication
    service encrypts that data, obviously,
  • 10:54 - 10:57
    since we don't want drivers
    to have access to all that,
  • 10:57 - 11:00
    and then sends it to a messaging service.
  • 11:00 - 11:04
    So messaging service is something that's
    rebuilt as part of the system.
  • 11:05 - 11:09
    It maintains a bidirectional communication
    channel with all drivers
  • 11:09 - 11:11
    on the Uber platform.
  • 11:11 - 11:14
    And this communication channel
    is actually separate from
  • 11:14 - 11:18
    the original request response channel
    which we've been traditionally using
  • 11:18 - 11:20
    at Uber for drivers to communicate
    with the server.
  • 11:20 - 11:24
    So this way, we are not affecting any
    normal business operation
  • 11:24 - 11:26
    due to this service.
  • 11:26 - 11:30
    So the messaging service then sends the
    message to the phone
  • 11:30 - 11:33
    and get an acknowledgement from them.
  • 11:35 - 11:41
    So from this design, what we have achieved
    is that we've isolated the applications
  • 11:41 - 11:47
    from any replication latencies or failures
    because our replication service
  • 11:47 - 11:53
    returns immediately and the only extra
    thing the application is doing
  • 11:53 - 11:57
    by opting in to this replication strategy
    is making an extra service call
  • 11:57 - 12:02
    to the replication service, which is going
    to be pretty cheap since it's within
  • 12:02 - 12:05
    the same datacenter,
    not traveling through internet.
  • 12:05 - 12:12
    Secondly, now having this separate channel
    gives us the ability to arbitrarily query
  • 12:12 - 12:16
    the states of the phone without affecting
    any normal business operations
  • 12:16 - 12:20
    and we can use that phone as a basic
    key-value store now.
  • 12:24 - 12:29
    Next...Okay so now comes the issue of
    moving between datacenters.
  • 12:29 - 12:33
    As I said earlier, when we failover we are
    actually leaving states behind
  • 12:33 - 12:35
    in that datacenter.
  • 12:35 - 12:37
    So how do we deal with stale states?
  • 12:37 - 12:40
    So the first approach we tried
    was actually do some manual cleanup.
  • 12:40 - 12:44
    So we wrote some cleanup scripts
    and every time you failover
  • 12:44 - 12:49
    from our primary datatcenter to our backup
    datacenter, somebody will run that script
  • 12:49 - 12:53
    in a primary and it will go to the
    datastores for the dispatching service
  • 12:53 - 12:55
    and it will clean out
    all the states there.
  • 12:55 - 12:59
    However, this approach had operational
    pain because somebody had to run it.
  • 12:59 - 13:05
    Moreover, we allowed the ability
    to failover per city so you can actually
  • 13:05 - 13:10
    choose to failover specific cities instead
    of the whole world, and in those cases
  • 13:10 - 13:13
    the script started becoming complicated.
  • 13:13 - 13:19
    So then we decided to tweak our design a
    little bit so that we solve this problem.
  • 13:19 - 13:25
    So the first thing we did was...as Josh
    mentioned earlier, the key which is
  • 13:25 - 13:30
    stored on the phone contains the trip
    identifier and the version within it.
  • 13:30 - 13:36
    So the version used to be an incrementing
    number so that we can keep track of
  • 13:36 - 13:37
    any followed progress you're making.
  • 13:37 - 13:42
    However, we changed that to a
    modified vector clock.
  • 13:42 - 13:47
    So using that vector clock, we can now
    compare data on the phone
  • 13:47 - 13:49
    and data on the server.
  • 13:49 - 13:53
    And if there is a miss, we can detect any
    causality violations using that.
  • 13:53 - 13:58
    And we can also resolve that using a very
    basic conflict resolution strategy.
  • 13:58 - 14:02
    So this way, we handle any issues
    with ongoing trips.
  • 14:02 - 14:06
    Now next came the
    issue of completed trips.
  • 14:06 - 14:12
    So, traditionally what we'd been doing is,
    when a trip is completed, we will delete
  • 14:12 - 14:14
    all the data about the trip
    from the phone.
  • 14:14 - 14:18
    We did that because we didn't want the
    replication data on the phone
  • 14:18 - 14:20
    to grow unbounded.
  • 14:20 - 14:23
    And once a trip is completed,
    it's probably no longer required
  • 14:23 - 14:25
    for restoration.
  • 14:25 - 14:29
    However, that has the side-effect that
    mobile has no idea now that this trip
  • 14:29 - 14:31
    ever happened.
  • 14:31 - 14:36
    So what will happen is if we failback to
    a datacenter with some stale data
  • 14:36 - 14:40
    about this trip, then you might actually
    end up putting the right driver
  • 14:40 - 14:44
    on that same trip, which is a pretty
    bad experience because he's suddenly
  • 14:44 - 14:47
    now driving somebody which he already
    dropped off, and he's probably
  • 14:47 - 14:49
    not gonna be paid for that.
  • 14:49 - 14:55
    So what we did to fix that was...
    on trip completion, we would store
  • 14:55 - 15:01
    a special key on the phone, and the
    version in that key has a flag in it.
  • 15:01 - 15:04
    That's why I called it
    a modified vector clock.
  • 15:04 - 15:08
    So it has a flag that says that this trip
    has already been completed,
  • 15:08 - 15:10
    and we store that on the phone.
  • 15:10 - 15:15
    Now when the replication service sees that
    this driver has this flag for the trip,
  • 15:15 - 15:19
    then can tell the dispatching service that
    "Hey, this trip has already been completed
  • 15:19 - 15:21
    and you should probably delete it."
  • 15:21 - 15:25
    So that way, we handle completed trips.
  • 15:25 - 15:31
    So if you think about it, storing trip
    data is kind of expensive because we have
  • 15:31 - 15:37
    this huge encrypted blob, of JSON maybe.
  • 15:37 - 15:43
    But we can store the...
    large completed trips
  • 15:43 - 15:46
    because there is no data
    associated with them.
  • 15:46 - 15:49
    So we can probably store weeks' worth
    of completed trips in the same amount
  • 15:49 - 15:53
    of memory as we would store one trip data.
  • 15:53 - 15:56
    So that's how we solve stale states.
  • 16:00 - 16:04
    So now, next comes the issue of ensuring
    four nines for reliability.
  • 16:04 - 16:10
    So we decided to exercise the system
    more often than a datacenter failure
  • 16:10 - 16:15
    because we wanted to get confident
    that the system actually works.
  • 16:15 - 16:19
    So our first approach was
    to do manual failovers.
  • 16:19 - 16:24
    So basically what happened was that
    bunch of us will gather in a room
  • 16:24 - 16:28
    every Monday and then pick a few cities
    and fail them over.
  • 16:28 - 16:31
    And after we fail them over to another
    datacenter, we'll see...
  • 16:31 - 16:37
    what was the success rate
    for the restoration, and if
  • 16:37 - 16:41
    there were any failures, then try to look
    at the logs and debug any issues there.
  • 16:41 - 16:44
    However, there were several problems with
    this approach.
  • 16:44 - 16:49
    First, it was very operationally painful.
    So, we had to do this every week.
  • 16:49 - 16:53
    And for a small fraction of trips,
    which did not get restored,
  • 16:53 - 16:56
    we will actually have to do
    fare adjustment
  • 16:56 - 16:58
    for both the rider and the driver.
  • 16:58 - 17:01
    Secondly, it led to a very poor
    customer experience
  • 17:01 - 17:05
    because for that same fraction,
    they were suddenly bumped off trip
  • 17:05 - 17:08
    and they got totally confused,
    like what happened to them?
  • 17:08 - 17:15
    Thirdly, it had a low coverage because
    we were covering only a few cities.
  • 17:15 - 17:19
    However, in the past we've seen problems
    which affected only a specific city.
  • 17:19 - 17:23
    Maybe because there was a new feature
    allowed in the city
  • 17:23 - 17:25
    which was not global yet.
  • 17:25 - 17:29
    So this approach does not help us
    catch those cases until it's too late.
  • 17:29 - 17:34
    Finally, we had no idea whether the
    backup datacenter can handle the load.
  • 17:34 - 17:39
    So in our current architecture, we have a
    primary datacenter which handles
  • 17:39 - 17:42
    all the requests and then backup
    datacenter which is waiting to handle
  • 17:42 - 17:45
    all those requests in case
    the primary goes down.
  • 17:45 - 17:47
    But how do we know that the
    backup datacenter
  • 17:47 - 17:48
    can handle all those requests?
  • 17:48 - 17:52
    So one way is, maybe you can provision
    the same number of boxes
  • 17:52 - 17:54
    and same type of hardware
    in the backup datacenter.
  • 17:54 - 17:57
    But what if there's a configuration issue?
  • 17:57 - 18:00
    In some of the services,
    we would never catch that.
  • 18:00 - 18:05
    And even if they're exactly the same,
    how do you know that each service
  • 18:05 - 18:09
    in the backup datacenter can handle
    a sudden flood of requests which comes
  • 18:09 - 18:11
    when there is a failure?
  • 18:11 - 18:14
    So we needed some way to fix
    all these problems.
  • 18:16 - 18:19
    So then to understand how to get
    good confidence in the system and
  • 18:19 - 18:23
    to measure it well, we looked at
    the key concepts behind the system
  • 18:23 - 18:25
    which we really wanted to work.
  • 18:25 - 18:30
    So first thing was we wanted to ensure
    that all mutations which are done
  • 18:30 - 18:33
    by the dispatching service are actually
    stored on the phone.
  • 18:33 - 18:38
    So for example, a driver, right after he
    picks up a passenger,
  • 18:38 - 18:40
    he may lose connectivity.
  • 18:40 - 18:43
    And so replication data may not
    be sent to the phone immediately
  • 18:43 - 18:48
    but we want to ensure that the data
    eventually makes it to the phone.
  • 18:48 - 18:52
    Secondly, we wanted to make sure
    that the stored data can actually be used
  • 18:52 - 18:54
    for replication.
  • 18:54 - 18:58
    For example, there may be some
    encryption-decryption issue with the data
  • 18:58 - 19:01
    and the data gets corrupted
    and it's no longer needed.
  • 19:01 - 19:04
    So even if you're storing the data,
    you cannot use it.
  • 19:04 - 19:06
    So there's no point.
  • 19:06 - 19:10
    Or, restoration actually involves
    rehydrating the states
  • 19:10 - 19:13
    within the dispatching service
    using the data.
  • 19:13 - 19:16
    So even if the data is fine,
    if there's any problem
  • 19:16 - 19:21
    during that rehydration process,
    some service behaving weirdly,
  • 19:21 - 19:25
    you would still have no use for that data
    and you would still lose the trip,
  • 19:25 - 19:28
    even though the data is perfectly fine.
  • 19:28 - 19:32
    Finally, as I mentioned earlier,
    we needed a way to figure out
  • 19:32 - 19:35
    whether the backup datacenters
    can handle the load.
  • 19:36 - 19:41
    So to monitor the health of the system
    better, we wrote another service.
  • 19:41 - 19:48
    Every hour it will get a list of all
    active drivers and trips
  • 19:48 - 19:51
    from our dispatching service.
  • 19:51 - 19:57
    And for all those drivers, it will use
    that messaging channel to ask for
  • 19:57 - 19:59
    their replication data.
  • 19:59 - 20:02
    And once it has the replication data,
    it will compare that data
  • 20:02 - 20:05
    with the data which the application
    expects.
  • 20:05 - 20:09
    And doing that, we get a lot of good
    metrics around, like...
  • 20:10 - 20:14
    What percentage of drivers have data
    successfully stored to them?
  • 20:14 - 20:18
    And you can even break down metrics
    by region or by any app versions.
  • 20:18 - 20:21
    So this really helped us
    drill into the problem.
  • 20:25 - 20:29
    Finally, to know whether the stored data
    can be used for replication,
  • 20:29 - 20:32
    and that the backup datacenter
    can handle the load,
  • 20:32 - 20:36
    what we do is we use all the data
    which we got in the previous step
  • 20:36 - 20:39
    and we send that to our backup datacenter.
  • 20:39 - 20:42
    And within the backup datacenter
    we perform what we call
  • 20:42 - 20:44
    a shatter restoration.
  • 20:44 - 20:50
    And since there is nobody else making any
    changes in that backup datacenter,
  • 20:50 - 20:54
    after the restoration is completed,
    we can just query the dispatching service
  • 20:54 - 20:56
    in the backup datacenter.
  • 20:56 - 21:01
    And say, "Hey, how many active riders,
    drivers, and trips do you have?"
  • 21:01 - 21:04
    And we can compare that number
    with the number we got
  • 21:04 - 21:06
    in our snapshot
    from the primary datacenter.
  • 21:06 - 21:11
    And using that, we get really valuable
    information around what's our success rate
  • 21:11 - 21:15
    and we can do similar breakdowns by
    different parameters
  • 21:15 - 21:17
    like region or app version.
  • 21:17 - 21:21
    Finally, we also get metrics around
    how well the backup datacenter did.
  • 21:21 - 21:25
    So did we subject it to a lot of load,
    or can it handle the traffic
  • 21:25 - 21:27
    when there is a real failure?
  • 21:27 - 21:32
    Also, any configuration issue
    in the backup datacenter
  • 21:32 - 21:35
    can be easily caught by this approach.
  • 21:35 - 21:40
    So using this service, we are
    constantly testing the system
  • 21:40 - 21:45
    and making sure we have confidence in it
    and can use it during a failure.
  • 21:45 - 21:48
    Cause if there's no confidence
    in the system, then it's pointless.
  • 21:51 - 21:54
    So yeah, that was the idea
    behind the system
  • 21:54 - 21:57
    and how we implemented it.
  • 21:57 - 22:01
    I did not get a chance to go into
    different [inaudible] of detail,
  • 22:01 - 22:05
    but if you guys have any questions,
    you can always reach out to us
  • 22:05 - 22:07
    during the office hours.
  • 22:07 - 22:09
    So thanks guys for coming
    and listening to us.
  • 22:10 - 22:14
    (audience claps)
Title:
How Uber Uses your Phone as a Backup Datacenter
Description:

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
22:19

English subtitles

Revisions Compare revisions