WEBVTT

00:00:06.549 --> 00:00:10.549
(audience claps)
(Man) Ok.

00:00:10.549 --> 00:00:14.549
Alright every body so let's dive in.

00:00:14.549 --> 00:00:18.089
So let's talk about how Uber trips
even happen

00:00:18.089 --> 00:00:20.099
before we get into the nitty gritty
of how

00:00:20.099 --> 00:00:22.949
we save them from the clutches
of the datacenter failover.

00:00:23.270 --> 00:00:26.560
So you might have heard we're all about
connecting riders and drivers.

00:00:26.560 --> 00:00:27.860
This is what it looks like.

00:00:27.860 --> 00:00:30.040
You've probably at least
seen the rider app.

00:00:30.040 --> 00:00:32.160
You get it out, you see
some cars on the map.

00:00:32.160 --> 00:00:34.460
You pick where you want
to get a pickup location.

00:00:34.460 --> 00:00:36.560
At the same time,
all these guys that you're

00:00:36.560 --> 00:00:39.030
seeing on the map, they have
a phone open somewhere.

00:00:39.030 --> 00:00:40.980
They're logged in waiting for a dispatch.

00:00:40.980 --> 00:00:43.050
They're all pinging
into the same datacenter

00:00:43.050 --> 00:00:45.000
for the city that you both are in.

00:00:45.482 --> 00:00:47.902
So then what happens is you put
that pin somewhere,

00:00:47.902 --> 00:00:51.082
and you get ready to pick up your trip,
you get ready to request.

00:00:51.884 --> 00:00:54.844
You hit request and 
that guy's phone starts beeping.

00:00:55.415 --> 00:00:58.065
Hopefully if everything works out,
he'll accept that trip.

00:00:58.984 --> 00:01:02.834
All these things that we're talking about
here, the request of a trip

00:01:02.834 --> 00:01:06.134
the offering of it to a driver, him 
accepting it.

00:01:06.134 --> 00:01:09.274
That's something we call a state change 
transition.

00:01:09.322 --> 00:01:11.712
From the moment that you start
requesting the trip,

00:01:11.712 --> 00:01:15.562
we start creating your trip data
in the backend datacenter.

00:01:15.770 --> 00:01:21.400
And that transaction that might live
for anything like 5, 10, 15, 20, 30,

00:01:21.400 --> 00:01:24.060
however many minutes it takes you
to take your trip,

00:01:24.060 --> 00:01:27.360
we have to consistently handle that trip
to get it all the way through

00:01:27.360 --> 00:01:30.620
to completion to get you
where you're going happily.

00:01:31.156 --> 00:01:33.466
So every time this state change happens,

00:01:36.609 --> 00:01:41.339
things happen in the world, so next up 
he goes ahead and shows up to you.

00:01:41.976 --> 00:01:45.216
He arrives, you get in the car, 
he begins the trip.

00:01:45.216 --> 00:01:47.016
Everything's going fine.

00:01:48.928 --> 00:01:51.868
So this is of course, ...
some of these state changes

00:01:51.868 --> 00:01:54.658
are more or less important to everything 
that's going on.

00:01:54.658 --> 00:01:57.998
The begin trip and the end trip are the
real important ones, of course.

00:01:57.998 --> 00:02:00.128
The ones that we don't want
to lose the most.

00:02:00.128 --> 00:02:02.378
But all these are really important
to keep onto.

00:02:02.378 --> 00:02:07.388
So what happens in the sense of
failure is your trip is gone.

00:02:07.388 --> 00:02:10.548
You're both back to, 
"OMG where'd my trip go?"

00:02:10.808 --> 00:02:14.718
There you're just seeing empty cars again
and he's back into an open thing

00:02:14.718 --> 00:02:17.528
like where you were when you 
started/opened the application

00:02:17.528 --> 00:02:18.448
in the first place.

00:02:18.448 --> 00:02:22.448
So this is what used to happen for us
not too long ago.

00:02:22.448 --> 00:02:25.698
So how do we fix this
how do you fix this in general?

00:02:25.698 --> 00:02:27.978
So classically you might try 
and say,

00:02:27.978 --> 00:02:31.978
"Well, let's take all the data in that one
datacenter and copy it replicate it to a

00:02:31.978 --> 00:02:33.358
backend data center."

00:02:33.358 --> 00:02:37.358
This is pretty well understood, classic
way to solve this problem.

00:02:37.597 --> 00:02:40.267
You control the active data center
the backup data center

00:02:43.208 --> 00:02:43.458
so it's pretty easy to reason about.

99:59:59.999 --> 99:59:59.999
People feel comfortable with this scheme.

99:59:59.999 --> 99:59:59.999
It could work more or less well depending
on what database you're using.

99:59:59.999 --> 99:59:59.999
But there's some drawbacks.

99:59:59.999 --> 99:59:59.999
It gets kinda complicated
beyond two datacenters.

99:59:59.999 --> 99:59:59.999
It's always gonna be subject
to replication lag

99:59:59.999 --> 99:59:59.999
because the datacenters are separated by
this thing called the internet,

99:59:59.999 --> 99:59:59.999
or maybe leased lines
if you get really into it.

99:59:59.999 --> 99:59:59.999
So it requires a constant level of high
bandwidth, especially if you're not using

99:59:59.999 --> 99:59:59.999
a database well-suited to replication,
or if you haven't really tuned

99:59:59.999 --> 99:59:59.999
your business model to get
the deltas really good.

99:59:59.999 --> 99:59:59.999
So, we chose not to go with this route.

99:59:59.999 --> 99:59:59.999
We instead said, "What if we could solve
it to vac down to the driver?"

99:59:59.999 --> 99:59:59.999
Because since we're already in constant
communication with these driver phones,

99:59:59.999 --> 99:59:59.999
what if we could just save the data there
to the driver phone?

99:59:59.999 --> 99:59:59.999
Then he could failover to any datacenter,
rather than having to control,

99:59:59.999 --> 99:59:59.999
"Well here's the backup datacenter for
this city, the backup datacenter

99:59:59.999 --> 99:59:59.999
for this city," and then, "Oh, no no, what
if in a failover, we fail the wrong phones

99:59:59.999 --> 99:59:59.999
to the wrong datacenter and now we lose
all their trips again?"

99:59:59.999 --> 99:59:59.999
That would not be cool.

99:59:59.999 --> 99:59:59.999
So we really decided to go with this
mobile implementation approach of saving

99:59:59.999 --> 99:59:59.999
the trips to the driver phone.

99:59:59.999 --> 99:59:59.999
But of course, it doesn't come without a
trade-off, the trade-off here being

99:59:59.999 --> 99:59:59.999
you've got to implement some kind of a
replication protocol in the driver phone

99:59:59.999 --> 99:59:59.999
consistently between whatever platforms
you support.

99:59:59.999 --> 99:59:59.999
In our case, iOS and Android.

99:59:59.999 --> 99:59:59.999
...But if it could, how would this work?

99:59:59.999 --> 99:59:59.999
So all these state transitions
are happening when the phones

99:59:59.999 --> 99:59:59.999
communicate with our datacenter.

99:59:59.999 --> 99:59:59.999
So if in response to his request to begin
trip or arrive, or accept, or any of this,

99:59:59.999 --> 99:59:59.999
if we could send some data back down to
his phone and have him keep ahold of it,

99:59:59.999 --> 99:59:59.999
then in the case of a datacenter failover,
when his phone pings into

99:59:59.999 --> 99:59:59.999
the new datacenter, we could request that
data right back off of his phone, and get

99:59:59.999 --> 99:59:59.999
you guys right back on your trip with
maybe only a minimal blip,

99:59:59.999 --> 99:59:59.999
in the worst case.

99:59:59.999 --> 99:59:59.999
So in a high level, that's the idea, but
there are some challenges of course

99:59:59.999 --> 99:59:59.999
in implementing that.

99:59:59.999 --> 99:59:59.999
Not all the trip information that we would
want to save is something we want

99:59:59.999 --> 99:59:59.999
the driver to have access to, like to be
able to get your trip and...

99:59:59.999 --> 99:59:59.999
in the other datacenter, we'd have to have
the full rider information.

99:59:59.999 --> 99:59:59.999
If you're fare splitting with some friends
it would need to be all rider information.

99:59:59.999 --> 99:59:59.999
So, there's a lot of things that we need
to save here to save your trip

99:59:59.999 --> 99:59:59.999
that we don't want to expose
to the driver.

99:59:59.999 --> 99:59:59.999
Also, you have to pretty much assume that
the driver phones are more or less

99:59:59.999 --> 99:59:59.999
trustable, either because people are doing
nefarious things with them,

99:59:59.999 --> 99:59:59.999
or people not the drivers have compromised
them, or somebody else between you and

99:59:59.999 --> 99:59:59.999
the driver. Who knows?

99:59:59.999 --> 99:59:59.999
So for most of these reasons, we decided
we had to go with the crypto approach

99:59:59.999 --> 99:59:59.999
and encrypt all the data that we store on
the phones to prevent against tampering

99:59:59.999 --> 99:59:59.999
and leaking of any kind of PII.

99:59:59.999 --> 99:59:59.999
And also, towards all these security
designs and also simple reliability of

99:59:59.999 --> 99:59:59.999
interacting with these phones, you want to
keep the replication protocol as simple

99:59:59.999 --> 99:59:59.999
as possible to make it easy to [inaudible]
easy to debug, remove failure cases,

99:59:59.999 --> 99:59:59.999
and you also want to minimize the extra
bandwidth.

99:59:59.999 --> 99:59:59.999
Kinda glossed over the bandwidth unpacks
when I said backend replication

99:59:59.999 --> 99:59:59.999
isn't really an option.

99:59:59.999 --> 99:59:59.999
But at least here when you're designing
this replication protocol,

99:59:59.999 --> 99:59:59.999
with the application layering you can be
much more in tune with what data

99:59:59.999 --> 99:59:59.999
you're serializing, and what
you're deltafying or not

99:59:59.999 --> 99:59:59.999
and really mind your bandwidth impact.

99:59:59.999 --> 99:59:59.999
Especially since it's going over a mobile
network, this becomes really salient.

99:59:59.999 --> 99:59:59.999
So, how do you keep it simple?

99:59:59.999 --> 99:59:59.999
In our case, we decided to go with a very
simple key value store with all your

99:59:59.999 --> 99:59:59.999
typical operations: get, set, delete,
and list all the keys please,

99:59:59.999 --> 99:59:59.999
with one caveat being you can only
set a key once so you can't

99:59:59.999 --> 99:59:59.999
accidentally overwrite a key;
it eliminates a whole class of weird


99:59:59.999 --> 99:59:59.999
programming errors or out of order message
delivery errors you might have

99:59:59.999 --> 99:59:59.999
in such a system.

99:59:59.999 --> 99:59:59.999
This did however then force us to move,
we'll call "versioning"

99:59:59.999 --> 99:59:59.999
into the keyspace though.

99:59:59.999 --> 99:59:59.999
You can't just say, "Oh' I've got a key
for this trip and please update it

99:59:59.999 --> 99:59:59.999
to the new version on each state change."

99:59:59.999 --> 99:59:59.999
No; instead you have to have a key for
trip and version, and you have to do

99:59:59.999 --> 99:59:59.999
a set of new one to leave the old one,
and that at least gives you the nice

99:59:59.999 --> 99:59:59.999
property that if that fails partway
through, between the sent and the delete,

99:59:59.999 --> 99:59:59.999
you fail into having two things to order
out of the no things stored.

99:59:59.999 --> 99:59:59.999
So there are some nice properties to
keeping a nice simple

99:59:59.999 --> 99:59:59.999
key value protocol here.

99:59:59.999 --> 99:59:59.999
And that makes failover resolution really
easy because it's simply a matter of,

99:59:59.999 --> 99:59:59.999
"What keys do you have?
What trips do you store?

99:59:59.999 --> 99:59:59.999
What keys do I have in the backend
datacenter?"

99:59:59.999 --> 99:59:59.999
Compare those, and come to a resolution
between those set of tricks.

99:59:59.999 --> 99:59:59.999
So that's a quick overview of how we built
this system.

99:59:59.999 --> 99:59:59.999
...Nikunj Aggarwal is going to give you
a rundown of some more details of how we

99:59:59.999 --> 99:59:59.999
really got the reliability of this system
to work at scale.

99:59:59.999 --> 99:59:59.999
(audience claps)

99:59:59.999 --> 99:59:59.999
Alright, hi! I'm Nikunj.

99:59:59.999 --> 99:59:59.999
So we talked about the idea
and the motivation behind the idea,

99:59:59.999 --> 99:59:59.999
and now let's dive into how did we design
such a solution, and what kind of cradles

99:59:59.999 --> 99:59:59.999
did we have to make while we were
doing the design.

99:59:59.999 --> 99:59:59.999
So first thing, we wanted to ensure was
that the system we built is non blocking

99:59:59.999 --> 99:59:59.999
but still provide eventual consistency.

99:59:59.999 --> 99:59:59.999
So basically...any backend application
using this system should be able to make

99:59:59.999 --> 99:59:59.999
[inaudible] progress, even when
the system is down.

99:59:59.999 --> 99:59:59.999
So the only trade-off, the application
should be making is that it may

99:59:59.999 --> 99:59:59.999
take some time for the data to actually be
stored on the phone.

99:59:59.999 --> 99:59:59.999
However, using this application should not
affect any normal business operations

99:59:59.999 --> 99:59:59.999
for them.

99:59:59.999 --> 99:59:59.999
Secondly, we wanted to have an ability
to move between datacenters without

99:59:59.999 --> 99:59:59.999
worrying about data already there.

99:59:59.999 --> 99:59:59.999
So when we failover from one datacenter to
another, that datacenter still had states

99:59:59.999 --> 99:59:59.999
in them, and it still has its view
of active drivers and trips.

99:59:59.999 --> 99:59:59.999
And no service in that datacenter is aware
that a failure actually happened.

99:59:59.999 --> 99:59:59.999
So at some later time, if we fail back to
the same datacenter, then it's view

99:59:59.999 --> 99:59:59.999
of the drivers and trips may be actually
different than what the drivers

99:59:59.999 --> 99:59:59.999
actually have and if we trusted that
datacenter, then the drivers may get an

99:59:59.999 --> 99:59:59.999
[inaudible], which is
a very bad experience.

99:59:59.999 --> 99:59:59.999
So we need some way to reconcile that data
between the drivers and the server.

99:59:59.999 --> 99:59:59.999
Finally, we want to be able to measure
the success of the system all the time.

99:59:59.999 --> 99:59:59.999
So the system is only fully executed
during a failure, and a datacenter failure

99:59:59.999 --> 99:59:59.999
is a pretty rare occurrence, and we don't
want to be in a situation where

99:59:59.999 --> 99:59:59.999
we detect issues with the system when we
need it the most.

99:59:59.999 --> 99:59:59.999
So what we want is an ability to
constantly be able to measure the success

99:59:59.999 --> 99:59:59.999
of the system so that we are confident in
it when a failure acutally happens.

99:59:59.999 --> 99:59:59.999
So in keeping all these issues in mind,
this is a very high level

99:59:59.999 --> 99:59:59.999
view of the system.

99:59:59.999 --> 99:59:59.999
I'm not going to go into details of any
of the services,

99:59:59.999 --> 99:59:59.999
since it's a mobile conference.

99:59:59.999 --> 99:59:59.999
So the first thing that happens is that
driver makes an update,

99:59:59.999 --> 99:59:59.999
or as Josh called it, a state change,
on his app.

99:59:59.999 --> 99:59:59.999
For example, he may pick up a passenger.

99:59:59.999 --> 99:59:59.999
Now that update comes as a request to the
dispatching service.

99:59:59.999 --> 99:59:59.999
Now the dispatching service, depending on
the type of request, it updates the trip

99:59:59.999 --> 99:59:59.999
model for that trip, and then it sends the
update to the replication service.

99:59:59.999 --> 99:59:59.999
Now the replication service will in queue
that request in its own data store

99:59:59.999 --> 99:59:59.999
and immediately return a successful
response to the dispatching service,

99:59:59.999 --> 99:59:59.999
and then finally the dispatching service
will update its own data store

99:59:59.999 --> 99:59:59.999
and then return a success to mobile.

99:59:59.999 --> 99:59:59.999
It may alter it on some other way of the
mobile, for example things might have

99:59:59.999 --> 99:59:59.999
changed since the last time mobile pinged
in, for example.

99:59:59.999 --> 99:59:59.999
...If it's an UberPool trip,
then the driver may have to pick up

99:59:59.999 --> 99:59:59.999
passenger.

99:59:59.999 --> 99:59:59.999
Or if the rider entered some destination,
he might have to tell the driver

99:59:59.999 --> 99:59:59.999
about that.

99:59:59.999 --> 99:59:59.999
And in the background, the replication
service encrypts that data, obviously,

99:59:59.999 --> 99:59:59.999
since we don't want drivers to have access
to all that,

99:59:59.999 --> 99:59:59.999
and then sends it to a messaging service.


99:59:59.999 --> 99:59:59.999
So messaging service is something that's
rebuilt as part of the system.

99:59:59.999 --> 99:59:59.999
It maintains a bidirectional communication
channel with all drivers

99:59:59.999 --> 99:59:59.999
on the Uber platform.

99:59:59.999 --> 99:59:59.999
And this communication channel is actually
separate from

99:59:59.999 --> 99:59:59.999
the original request response channel
which we've been traditionally using

99:59:59.999 --> 99:59:59.999
at Uber for drivers to communicate
with the server.

99:59:59.999 --> 99:59:59.999
So this way, we are not affecting any
normal business operation

99:59:59.999 --> 99:59:59.999
due to this service.

99:59:59.999 --> 99:59:59.999
So the messaging service then sends the
message to the phone

99:59:59.999 --> 99:59:59.999
and get an acknowledgement from them.

99:59:59.999 --> 99:59:59.999
So from this design, what we have achieved
is that we've isolated the applications

99:59:59.999 --> 99:59:59.999
from any replication latencies or failures
because our replication service

99:59:59.999 --> 99:59:59.999
returns immediately and the only extra
thing the application is doing

99:59:59.999 --> 99:59:59.999
by opting in to this replication strategy
is making an extra service call

99:59:59.999 --> 99:59:59.999
to the replication service, which is going
to be pretty cheap since it's within

99:59:59.999 --> 99:59:59.999
the same datacenter,
not traveing through internet.

99:59:59.999 --> 99:59:59.999
Secondly, now having this separate channel
gives us the ability to arbitrarily query

99:59:59.999 --> 99:59:59.999
the states of the phone without affecting
any normal business operations

99:59:59.999 --> 99:59:59.999
and we can use that phone as a basic key
[inaudible] restore now.

99:59:59.999 --> 99:59:59.999
Next...Okay so now comes the issue of
moving between datacenters.

99:59:59.999 --> 99:59:59.999
As I said earlier, when we failover we are
actually leaving states behind

99:59:59.999 --> 99:59:59.999
in that datacenter.

99:59:59.999 --> 99:59:59.999
So how do datum stay in states?

99:59:59.999 --> 99:59:59.999
So the first approach we tried
was actually do some manual cleanup.

99:59:59.999 --> 99:59:59.999
So we wrote some cleanup scripts and every
time you failover

99:59:59.999 --> 99:59:59.999
from our priming datatcenter to our backup
datacenter, somebody will run that script

99:59:59.999 --> 99:59:59.999
in a priming and it will go to the
datastores for the dispatching service

99:59:59.999 --> 99:59:59.999
and it will clean out
all the states there.

99:59:59.999 --> 99:59:59.999
However, this approach had operational
paying because somebody had to run it.

99:59:59.999 --> 99:59:59.999
Moreover, the allowed ability to failover
per city so you can actually choose

99:59:59.999 --> 99:59:59.999
to failover specific cities instead of the
whole world, and in those cases

99:59:59.999 --> 99:59:59.999
the script started becoming complicated.

99:59:59.999 --> 99:59:59.999
So then we decided to tweak our design a
little bit so that we solve this problem.

99:59:59.999 --> 99:59:59.999
So the first thing we did was...as Josh
mentioned earlier, the key which is

99:59:59.999 --> 99:59:59.999
stored on the phone contains the trip
identifier and the version within it.

99:59:59.999 --> 99:59:59.999
So the version used to be an incrementing
number so that we can keep track of

99:59:59.999 --> 99:59:59.999
any followed progress you're making.

99:59:59.999 --> 99:59:59.999
However, we changed that through a
modified retic log.

99:59:59.999 --> 99:59:59.999
So mucing that retic log, we can now
compare data on the phone

99:59:59.999 --> 99:59:59.999
and data on the server.

99:59:59.999 --> 99:59:59.999
And if there is a miss--we can detect any
causality [inaudible] using that.

99:59:59.999 --> 99:59:59.999
And we can also resolve that using a very
basic conflict resolution strategy.

99:59:59.999 --> 99:59:59.999
So this way, we handle any issues
with ongoing trips.

99:59:59.999 --> 99:59:59.999
Now next came the issue of computed trips.

99:59:59.999 --> 99:59:59.999
So, traditionally what we'd been doing is,
when a trip is completed, will delete

99:59:59.999 --> 99:59:59.999
all the data about the trip
from the phone.

99:59:59.999 --> 99:59:59.999
We did that because we didn't want the
replication data on the phone

99:59:59.999 --> 99:59:59.999
to grow unbounded.

99:59:59.999 --> 99:59:59.999
And once a trip is completed, it's
probably no longer required

99:59:59.999 --> 99:59:59.999
for restoration.

99:59:59.999 --> 99:59:59.999
However, that has side effect that mobile
has no idea now that the script

99:59:59.999 --> 99:59:59.999
ever happened.

99:59:59.999 --> 99:59:59.999
So what will happen is if we failback to
a datacenter with some stale data

99:59:59.999 --> 99:59:59.999
about this trip, then you might actually
end up putting the right driver

99:59:59.999 --> 99:59:59.999
on that same trip, which is a pretty
bad experience because he's suddenly

99:59:59.999 --> 99:59:59.999
now driving somebody which he already
dropped off, and he's probably

99:59:59.999 --> 99:59:59.999
not gonna be paid for that.

99:59:59.999 --> 99:59:59.999
So what we did to fix that was...
on trip completion, we would store

99:59:59.999 --> 99:59:59.999
a special key on the phone, and the
version in that key has a flag in it.

99:59:59.999 --> 99:59:59.999
That's why I call it a modified retic log.

99:59:59.999 --> 99:59:59.999
So it has a flag that says that this trip
has already been completed,

99:59:59.999 --> 99:59:59.999
and we store that on the phone.

99:59:59.999 --> 99:59:59.999
Now when the replication service sees that
this driver has this flag for the trip,

99:59:59.999 --> 99:59:59.999
then can tell the dispatching service that
"Hey, this trip is already been completed

99:59:59.999 --> 99:59:59.999
and you should probably delete it."

99:59:59.999 --> 99:59:59.999
So that way, we handle completed trips.

99:59:59.999 --> 99:59:59.999
So if you think about it, storing trip
data is kind of expensive because we have

99:59:59.999 --> 99:59:59.999
this huge encrypted log of JSON maybe.

99:59:59.999 --> 99:59:59.999
But we can store the...
large completed trips

99:59:59.999 --> 99:59:59.999
because there is no data
associated with them.

99:59:59.999 --> 99:59:59.999
So we can probably store weeks worth
of completed trips in the same amount

99:59:59.999 --> 99:59:59.999
of memory as we would store one trip data.

99:59:59.999 --> 99:59:59.999
So that's how we solve stale states.

99:59:59.999 --> 99:59:59.999
So now, next comes the issue of ensuring
four nines for reliability.

99:59:59.999 --> 99:59:59.999
So we decided to exercise the system
more often than a datacenter failure

99:59:59.999 --> 99:59:59.999
because we wanted to get confident
that the system actually works.

99:59:59.999 --> 99:59:59.999
So our first approach was
to do manual failovers.

99:59:59.999 --> 99:59:59.999
So basically what happened was that
bunch of us will gather in a room

99:59:59.999 --> 99:59:59.999
every Monday and then pick a few cities
and fail them over.

99:59:59.999 --> 99:59:59.999
And after we fail them over to one of the
datacenter, we'll see...

99:59:59.999 --> 99:59:59.999
what was the success rate
for the restoration

99:59:59.999 --> 99:59:59.999
and if there were any failures, then try
to look at the logs and debug any issues

99:59:59.999 --> 99:59:59.999
there.

99:59:59.999 --> 99:59:59.999
However, there were several problems with
this approach.

99:59:59.999 --> 99:59:59.999
First, it was very operationally painful.
So we had to do this every week.

99:59:59.999 --> 99:59:59.999
And for a smal fraction of crypts,
which did not get restored,

99:59:59.999 --> 99:59:59.999
we will actually have to do
fare adjustment

99:59:59.999 --> 99:59:59.999
for both the rider and the driver.

99:59:59.999 --> 99:59:59.999
Secondly, it led to a very poor
customer experience

99:59:59.999 --> 99:59:59.999
because for that same fraction, 
they were suddenly bumped off trip

99:59:59.999 --> 99:59:59.999
and they got totally confused,
like what happened to them?

99:59:59.999 --> 99:59:59.999
Thirdly, it had a low coverage because
we were covering only a few cities.

99:59:59.999 --> 99:59:59.999
However, in the past we've seen problems
which affected only a specific city.

99:59:59.999 --> 99:59:59.999
Maybe because there was a new feature
allowed in the city which was not

99:59:59.999 --> 99:59:59.999
[inaudible] yet.

99:59:59.999 --> 99:59:59.999
So this approach does not help us
catch those cases until it's too late.

99:59:59.999 --> 99:59:59.999
Finally, we had no idea whether the
backup datacenter can handle the load.

99:59:59.999 --> 99:59:59.999
So in our current architecture, we have a
primary datacenter which handles

99:59:59.999 --> 99:59:59.999
all the requests and then backup
datacenter which is waiting to handle

99:59:59.999 --> 99:59:59.999
all those requests in case
the primary goes down.

99:59:59.999 --> 99:59:59.999
But how do we know that the
backup datacenter

99:59:59.999 --> 99:59:59.999
can handle all those requests?

99:59:59.999 --> 99:59:59.999
So one way is maybe you can provision
the same number of boxes

99:59:59.999 --> 99:59:59.999
and same type of hardware
in the backup datacenter.

99:59:59.999 --> 99:59:59.999
But what if there's a configuration issue?

99:59:59.999 --> 99:59:59.999
In some of the services, 
we would never catch that.

99:59:59.999 --> 99:59:59.999
And even if they're exactly the same,
how do you know that each service

99:59:59.999 --> 99:59:59.999
in the backup datacenter can handle
a sudden flood of requests which comes

99:59:59.999 --> 99:59:59.999
when there is a failure?

99:59:59.999 --> 99:59:59.999
So we needed some way to fix
all these problems.

99:59:59.999 --> 99:59:59.999
So then to understand how to get
good confidence in the system and

99:59:59.999 --> 99:59:59.999
to measure it well, we looked at
the key concepts behind the system

99:59:59.999 --> 99:59:59.999
which we really wanted to work.

99:59:59.999 --> 99:59:59.999
So first thing was we wanted to ensure
that all mutations which are done

99:59:59.999 --> 99:59:59.999
by the dispatching service are actually
still on the phone.

99:59:59.999 --> 99:59:59.999
So for example, a driver, right after he
picks up a passenger,

99:59:59.999 --> 99:59:59.999
he may lose connectivity.

99:59:59.999 --> 99:59:59.999
And so replication data may not
be sent to the phone immediately

99:59:59.999 --> 99:59:59.999
but we want to ensure that the data
eventually makes it to the phone.

99:59:59.999 --> 99:59:59.999
Secondly, we wanted to make sure
that the stored data can actually be used

99:59:59.999 --> 99:59:59.999
for replication.

99:59:59.999 --> 99:59:59.999
For example, there may be some
encryption decryption issue with the data

99:59:59.999 --> 99:59:59.999
and the data gets corrupted
and it's no longer needed.

99:59:59.999 --> 99:59:59.999
So even if you're storing the data,
you cannot use it.

99:59:59.999 --> 99:59:59.999
So there's no point.

99:59:59.999 --> 99:59:59.999
Or, restoration actually involves
rehydrating the states

99:59:59.999 --> 99:59:59.999
within the dispatching service
using the data.

99:59:59.999 --> 99:59:59.999
So even if the data is fine,
if there's any problem

99:59:59.999 --> 99:59:59.999
during that rehydration process,
some service behaving [inaudible],

99:59:59.999 --> 99:59:59.999
you would still have no use for that data
and you would still lose the trip,

99:59:59.999 --> 99:59:59.999
even though the data is perfectly fine.

99:59:59.999 --> 99:59:59.999
Finally, as I mentioned earlier,
we needed a way to figure out

99:59:59.999 --> 99:59:59.999
whether the backup datacenters can handle
the load.

99:59:59.999 --> 99:59:59.999
So to monitor the health of the system
better, we wrote another service.

99:59:59.999 --> 99:59:59.999
Every hour it will get a list of all
active drivers and trips

99:59:59.999 --> 99:59:59.999
from our dispatching service.

99:59:59.999 --> 99:59:59.999
And for all those drivers, it will use
that messaging channel to ask for

99:59:59.999 --> 99:59:59.999
their replication data.

99:59:59.999 --> 99:59:59.999
And once it has the replication data,
it will compare that data

99:59:59.999 --> 99:59:59.999
with the data which the application
expects.

99:59:59.999 --> 99:59:59.999
And doing that, we get a lot of good
metrics around, like...

99:59:59.999 --> 99:59:59.999
What percentage of drivers have data
successfully stored to them?

99:59:59.999 --> 99:59:59.999
And you can even break down metrics
by region or by any app versions.

99:59:59.999 --> 99:59:59.999
So this really helped us
grill into the problem.

99:59:59.999 --> 99:59:59.999
Finally, to know whether the stored data
can be used for replication,

99:59:59.999 --> 99:59:59.999
and that the backup datacenter
can handle the load,

99:59:59.999 --> 99:59:59.999
what we do is we use all the data
which we got in the previous step

99:59:59.999 --> 99:59:59.999
and we send that to our backup datacenter.

99:59:59.999 --> 99:59:59.999
And within the backup datacenter
we perform what we call

99:59:59.999 --> 99:59:59.999
a shatter restoration.

99:59:59.999 --> 99:59:59.999
And since there is nobody else making any
changes in that backup datacenter,

99:59:59.999 --> 99:59:59.999
after the restoration is completed,
we can just query the dispatching service

99:59:59.999 --> 99:59:59.999
in the backup datacenter.

99:59:59.999 --> 99:59:59.999
And say, "Hey, how many active riders,
drivers, and trips do you have?"

99:59:59.999 --> 99:59:59.999
And we can compare that number
with the number we got

99:59:59.999 --> 99:59:59.999
in our snapshot
from the primary datacenter.

99:59:59.999 --> 99:59:59.999
And using that, we get really valuable
information around what's our success rate

99:59:59.999 --> 99:59:59.999
and we can do similar breakdowns by
different parameters

99:59:59.999 --> 99:59:59.999
like region or app version.

99:59:59.999 --> 99:59:59.999
Finally, we also get metrics around
how well the backup datacenter did.

99:59:59.999 --> 99:59:59.999
So did we subject it to a lot of load,
or can it handle the traffic

99:59:59.999 --> 99:59:59.999
when there is a real failure?

99:59:59.999 --> 99:59:59.999
Also, any configuration issue
in the backup datacenter

99:59:59.999 --> 99:59:59.999
can be easily caught by this approach.

99:59:59.999 --> 99:59:59.999
So using this service, we are
constantly testing the system

99:59:59.999 --> 99:59:59.999
and making sure we have confidence in it
and can use it during a failure.

99:59:59.999 --> 99:59:59.999
Cause if there's no confidence
in the system, then it's pointless.

99:59:59.999 --> 99:59:59.999
So yeah, that was the idea
behind the system

99:59:59.999 --> 99:59:59.999
and how we implemented it.

99:59:59.999 --> 99:59:59.999
I did not get a chance to go into
different [inaudible] of detail,

99:59:59.999 --> 99:59:59.999
but if you guys have any questions,
you can always reach out to us

99:59:59.999 --> 99:59:59.999
during the office hours.

99:59:59.999 --> 99:59:59.999
So thanks guys for coming
and listening to us.

99:59:59.999 --> 99:59:59.999
(audience claps)