WEBVTT 00:00:06.549 --> 00:00:10.549 (audience claps) (Man) Ok. 00:00:10.549 --> 00:00:14.549 Alright every body so let's dive in. 00:00:14.549 --> 00:00:18.089 So let's talk about how Uber trips even happen 00:00:18.089 --> 00:00:20.099 before we get into the nitty gritty of how 00:00:20.099 --> 00:00:22.949 we save them from the clutches of the datacenter failover. 00:00:23.270 --> 00:00:26.560 So you might have heard we're all about connecting riders and drivers. 00:00:26.560 --> 00:00:27.860 This is what it looks like. 00:00:27.860 --> 00:00:30.040 You've probably at least seen the rider app. 00:00:30.040 --> 00:00:32.160 You get it out, you see some cars on the map. 00:00:32.160 --> 00:00:34.460 You pick where you want to get a pickup location. 00:00:34.460 --> 00:00:36.560 At the same time, all these guys that you're 00:00:36.560 --> 00:00:39.030 seeing on the map, they have a phone open somewhere. 00:00:39.030 --> 00:00:40.980 They're logged in waiting for a dispatch. 00:00:40.980 --> 00:00:43.050 They're all pinging into the same datacenter 00:00:43.050 --> 00:00:45.000 for the city that you both are in. 00:00:45.482 --> 00:00:47.902 So then what happens is you put that pin somewhere, 00:00:47.902 --> 00:00:51.082 and you get ready to pick up your trip, you get ready to request. 00:00:51.884 --> 00:00:54.844 You hit request and that guy's phone starts beeping. 00:00:55.415 --> 00:00:58.065 Hopefully if everything works out, he'll accept that trip. 00:00:58.984 --> 00:01:02.834 All these things that we're talking about here, the request of a trip 00:01:02.834 --> 00:01:06.134 the offering of it to a driver, him accepting it. 00:01:06.134 --> 00:01:09.274 That's something we call a state change transition. 00:01:09.322 --> 00:01:11.712 From the moment that you start requesting the trip, 00:01:11.712 --> 00:01:15.562 we start creating your trip data in the backend datacenter. 00:01:15.770 --> 00:01:21.400 And that transaction that might live for anything like 5, 10, 15, 20, 30, 00:01:21.400 --> 00:01:24.060 however many minutes it takes you to take your trip, 00:01:24.060 --> 00:01:27.360 we have to consistently handle that trip to get it all the way through 00:01:27.360 --> 00:01:30.620 to completion to get you where you're going happily. 00:01:31.156 --> 00:01:33.466 So every time this state change happens, 00:01:36.609 --> 00:01:41.339 things happen in the world, so next up he goes ahead and shows up to you. 00:01:41.976 --> 00:01:45.216 He arrives, you get in the car, he begins the trip. 00:01:45.216 --> 00:01:47.016 Everything's going fine. 00:01:48.928 --> 00:01:51.868 So this is of course, ... some of these state changes 00:01:51.868 --> 00:01:54.658 are more or less important to everything that's going on. 00:01:54.658 --> 00:01:57.998 The begin trip and the end trip are the real important ones, of course. 00:01:57.998 --> 00:02:00.128 The ones that we don't want to lose the most. 00:02:00.128 --> 00:02:02.378 But all these are really important to keep onto. 00:02:02.378 --> 00:02:07.388 So what happens in the sense of failure is your trip is gone. 00:02:07.388 --> 00:02:10.548 You're both back to, "OMG where'd my trip go?" 00:02:10.808 --> 00:02:14.718 There you're just seeing empty cars again and he's back into an open thing 00:02:14.718 --> 00:02:17.528 like where you were when you started/opened the application 00:02:17.528 --> 00:02:18.448 in the first place. 00:02:18.448 --> 00:02:22.448 So this is what used to happen for us not too long ago. 00:02:22.448 --> 00:02:25.698 So how do we fix this how do you fix this in general? 00:02:25.698 --> 00:02:27.978 So classically you might try and say, 00:02:27.978 --> 00:02:31.978 "Well, let's take all the data in that one datacenter and copy it replicate it to a 00:02:31.978 --> 00:02:33.358 backend data center." 00:02:33.358 --> 00:02:37.358 This is pretty well understood, classic way to solve this problem. 00:02:37.597 --> 00:02:40.267 You control the active data center the backup data center 00:02:43.208 --> 00:02:43.458 so it's pretty easy to reason about. 99:59:59.999 --> 99:59:59.999 People feel comfortable with this scheme. 99:59:59.999 --> 99:59:59.999 It could work more or less well depending on what database you're using. 99:59:59.999 --> 99:59:59.999 But there's some drawbacks. 99:59:59.999 --> 99:59:59.999 It gets kinda complicated beyond two datacenters. 99:59:59.999 --> 99:59:59.999 It's always gonna be subject to replication lag 99:59:59.999 --> 99:59:59.999 because the datacenters are separated by this thing called the internet, 99:59:59.999 --> 99:59:59.999 or maybe leased lines if you get really into it. 99:59:59.999 --> 99:59:59.999 So it requires a constant level of high bandwidth, especially if you're not using 99:59:59.999 --> 99:59:59.999 a database well-suited to replication, or if you haven't really tuned 99:59:59.999 --> 99:59:59.999 your business model to get the deltas really good. 99:59:59.999 --> 99:59:59.999 So, we chose not to go with this route. 99:59:59.999 --> 99:59:59.999 We instead said, "What if we could solve it to vac down to the driver?" 99:59:59.999 --> 99:59:59.999 Because since we're already in constant communication with these driver phones, 99:59:59.999 --> 99:59:59.999 what if we could just save the data there to the driver phone? 99:59:59.999 --> 99:59:59.999 Then he could failover to any datacenter, rather than having to control, 99:59:59.999 --> 99:59:59.999 "Well here's the backup datacenter for this city, the backup datacenter 99:59:59.999 --> 99:59:59.999 for this city," and then, "Oh, no no, what if in a failover, we fail the wrong phones 99:59:59.999 --> 99:59:59.999 to the wrong datacenter and now we lose all their trips again?" 99:59:59.999 --> 99:59:59.999 That would not be cool. 99:59:59.999 --> 99:59:59.999 So we really decided to go with this mobile implementation approach of saving 99:59:59.999 --> 99:59:59.999 the trips to the driver phone. 99:59:59.999 --> 99:59:59.999 But of course, it doesn't come without a trade-off, the trade-off here being 99:59:59.999 --> 99:59:59.999 you've got to implement some kind of a replication protocol in the driver phone 99:59:59.999 --> 99:59:59.999 consistently between whatever platforms you support. 99:59:59.999 --> 99:59:59.999 In our case, iOS and Android. 99:59:59.999 --> 99:59:59.999 ...But if it could, how would this work? 99:59:59.999 --> 99:59:59.999 So all these state transitions are happening when the phones 99:59:59.999 --> 99:59:59.999 communicate with our datacenter. 99:59:59.999 --> 99:59:59.999 So if in response to his request to begin trip or arrive, or accept, or any of this, 99:59:59.999 --> 99:59:59.999 if we could send some data back down to his phone and have him keep ahold of it, 99:59:59.999 --> 99:59:59.999 then in the case of a datacenter failover, when his phone pings into 99:59:59.999 --> 99:59:59.999 the new datacenter, we could request that data right back off of his phone, and get 99:59:59.999 --> 99:59:59.999 you guys right back on your trip with maybe only a minimal blip, 99:59:59.999 --> 99:59:59.999 in the worst case. 99:59:59.999 --> 99:59:59.999 So in a high level, that's the idea, but there are some challenges of course 99:59:59.999 --> 99:59:59.999 in implementing that. 99:59:59.999 --> 99:59:59.999 Not all the trip information that we would want to save is something we want 99:59:59.999 --> 99:59:59.999 the driver to have access to, like to be able to get your trip and... 99:59:59.999 --> 99:59:59.999 in the other datacenter, we'd have to have the full rider information. 99:59:59.999 --> 99:59:59.999 If you're fare splitting with some friends it would need to be all rider information. 99:59:59.999 --> 99:59:59.999 So, there's a lot of things that we need to save here to save your trip 99:59:59.999 --> 99:59:59.999 that we don't want to expose to the driver. 99:59:59.999 --> 99:59:59.999 Also, you have to pretty much assume that the driver phones are more or less 99:59:59.999 --> 99:59:59.999 trustable, either because people are doing nefarious things with them, 99:59:59.999 --> 99:59:59.999 or people not the drivers have compromised them, or somebody else between you and 99:59:59.999 --> 99:59:59.999 the driver. Who knows? 99:59:59.999 --> 99:59:59.999 So for most of these reasons, we decided we had to go with the crypto approach 99:59:59.999 --> 99:59:59.999 and encrypt all the data that we store on the phones to prevent against tampering 99:59:59.999 --> 99:59:59.999 and leaking of any kind of PII. 99:59:59.999 --> 99:59:59.999 And also, towards all these security designs and also simple reliability of 99:59:59.999 --> 99:59:59.999 interacting with these phones, you want to keep the replication protocol as simple 99:59:59.999 --> 99:59:59.999 as possible to make it easy to [inaudible] easy to debug, remove failure cases, 99:59:59.999 --> 99:59:59.999 and you also want to minimize the extra bandwidth. 99:59:59.999 --> 99:59:59.999 Kinda glossed over the bandwidth unpacks when I said backend replication 99:59:59.999 --> 99:59:59.999 isn't really an option. 99:59:59.999 --> 99:59:59.999 But at least here when you're designing this replication protocol, 99:59:59.999 --> 99:59:59.999 with the application layering you can be much more in tune with what data 99:59:59.999 --> 99:59:59.999 you're serializing, and what you're deltafying or not 99:59:59.999 --> 99:59:59.999 and really mind your bandwidth impact. 99:59:59.999 --> 99:59:59.999 Especially since it's going over a mobile network, this becomes really salient. 99:59:59.999 --> 99:59:59.999 So, how do you keep it simple? 99:59:59.999 --> 99:59:59.999 In our case, we decided to go with a very simple key value store with all your 99:59:59.999 --> 99:59:59.999 typical operations: get, set, delete, and list all the keys please, 99:59:59.999 --> 99:59:59.999 with one caveat being you can only set a key once so you can't 99:59:59.999 --> 99:59:59.999 accidentally overwrite a key; it eliminates a whole class of weird 99:59:59.999 --> 99:59:59.999 programming errors or out of order message delivery errors you might have 99:59:59.999 --> 99:59:59.999 in such a system. 99:59:59.999 --> 99:59:59.999 This did however then force us to move, we'll call "versioning" 99:59:59.999 --> 99:59:59.999 into the keyspace though. 99:59:59.999 --> 99:59:59.999 You can't just say, "Oh' I've got a key for this trip and please update it 99:59:59.999 --> 99:59:59.999 to the new version on each state change." 99:59:59.999 --> 99:59:59.999 No; instead you have to have a key for trip and version, and you have to do 99:59:59.999 --> 99:59:59.999 a set of new one to leave the old one, and that at least gives you the nice 99:59:59.999 --> 99:59:59.999 property that if that fails partway through, between the sent and the delete, 99:59:59.999 --> 99:59:59.999 you fail into having two things to order out of the no things stored. 99:59:59.999 --> 99:59:59.999 So there are some nice properties to keeping a nice simple 99:59:59.999 --> 99:59:59.999 key value protocol here. 99:59:59.999 --> 99:59:59.999 And that makes failover resolution really easy because it's simply a matter of, 99:59:59.999 --> 99:59:59.999 "What keys do you have? What trips do you store? 99:59:59.999 --> 99:59:59.999 What keys do I have in the backend datacenter?" 99:59:59.999 --> 99:59:59.999 Compare those, and come to a resolution between those set of tricks. 99:59:59.999 --> 99:59:59.999 So that's a quick overview of how we built this system. 99:59:59.999 --> 99:59:59.999 ...Nikunj Aggarwal is going to give you a rundown of some more details of how we 99:59:59.999 --> 99:59:59.999 really got the reliability of this system to work at scale. 99:59:59.999 --> 99:59:59.999 (audience claps) 99:59:59.999 --> 99:59:59.999 Alright, hi! I'm Nikunj. 99:59:59.999 --> 99:59:59.999 So we talked about the idea and the motivation behind the idea, 99:59:59.999 --> 99:59:59.999 and now let's dive into how did we design such a solution, and what kind of cradles 99:59:59.999 --> 99:59:59.999 did we have to make while we were doing the design. 99:59:59.999 --> 99:59:59.999 So first thing, we wanted to ensure was that the system we built is non blocking 99:59:59.999 --> 99:59:59.999 but still provide eventual consistency. 99:59:59.999 --> 99:59:59.999 So basically...any backend application using this system should be able to make 99:59:59.999 --> 99:59:59.999 [inaudible] progress, even when the system is down. 99:59:59.999 --> 99:59:59.999 So the only trade-off, the application should be making is that it may 99:59:59.999 --> 99:59:59.999 take some time for the data to actually be stored on the phone. 99:59:59.999 --> 99:59:59.999 However, using this application should not affect any normal business operations 99:59:59.999 --> 99:59:59.999 for them. 99:59:59.999 --> 99:59:59.999 Secondly, we wanted to have an ability to move between datacenters without 99:59:59.999 --> 99:59:59.999 worrying about data already there. 99:59:59.999 --> 99:59:59.999 So when we failover from one datacenter to another, that datacenter still had states 99:59:59.999 --> 99:59:59.999 in them, and it still has its view of active drivers and trips. 99:59:59.999 --> 99:59:59.999 And no service in that datacenter is aware that a failure actually happened. 99:59:59.999 --> 99:59:59.999 So at some later time, if we fail back to the same datacenter, then it's view 99:59:59.999 --> 99:59:59.999 of the drivers and trips may be actually different than what the drivers 99:59:59.999 --> 99:59:59.999 actually have and if we trusted that datacenter, then the drivers may get an 99:59:59.999 --> 99:59:59.999 [inaudible], which is a very bad experience. 99:59:59.999 --> 99:59:59.999 So we need some way to reconcile that data between the drivers and the server. 99:59:59.999 --> 99:59:59.999 Finally, we want to be able to measure the success of the system all the time. 99:59:59.999 --> 99:59:59.999 So the system is only fully executed during a failure, and a datacenter failure 99:59:59.999 --> 99:59:59.999 is a pretty rare occurrence, and we don't want to be in a situation where 99:59:59.999 --> 99:59:59.999 we detect issues with the system when we need it the most. 99:59:59.999 --> 99:59:59.999 So what we want is an ability to constantly be able to measure the success 99:59:59.999 --> 99:59:59.999 of the system so that we are confident in it when a failure acutally happens. 99:59:59.999 --> 99:59:59.999 So in keeping all these issues in mind, this is a very high level 99:59:59.999 --> 99:59:59.999 view of the system. 99:59:59.999 --> 99:59:59.999 I'm not going to go into details of any of the services, 99:59:59.999 --> 99:59:59.999 since it's a mobile conference. 99:59:59.999 --> 99:59:59.999 So the first thing that happens is that driver makes an update, 99:59:59.999 --> 99:59:59.999 or as Josh called it, a state change, on his app. 99:59:59.999 --> 99:59:59.999 For example, he may pick up a passenger. 99:59:59.999 --> 99:59:59.999 Now that update comes as a request to the dispatching service. 99:59:59.999 --> 99:59:59.999 Now the dispatching service, depending on the type of request, it updates the trip 99:59:59.999 --> 99:59:59.999 model for that trip, and then it sends the update to the replication service. 99:59:59.999 --> 99:59:59.999 Now the replication service will in queue that request in its own data store 99:59:59.999 --> 99:59:59.999 and immediately return a successful response to the dispatching service, 99:59:59.999 --> 99:59:59.999 and then finally the dispatching service will update its own data store 99:59:59.999 --> 99:59:59.999 and then return a success to mobile. 99:59:59.999 --> 99:59:59.999 It may alter it on some other way of the mobile, for example things might have 99:59:59.999 --> 99:59:59.999 changed since the last time mobile pinged in, for example. 99:59:59.999 --> 99:59:59.999 ...If it's an UberPool trip, then the driver may have to pick up 99:59:59.999 --> 99:59:59.999 passenger. 99:59:59.999 --> 99:59:59.999 Or if the rider entered some destination, he might have to tell the driver 99:59:59.999 --> 99:59:59.999 about that. 99:59:59.999 --> 99:59:59.999 And in the background, the replication service encrypts that data, obviously, 99:59:59.999 --> 99:59:59.999 since we don't want drivers to have access to all that, 99:59:59.999 --> 99:59:59.999 and then sends it to a messaging service. 99:59:59.999 --> 99:59:59.999 So messaging service is something that's rebuilt as part of the system. 99:59:59.999 --> 99:59:59.999 It maintains a bidirectional communication channel with all drivers 99:59:59.999 --> 99:59:59.999 on the Uber platform. 99:59:59.999 --> 99:59:59.999 And this communication channel is actually separate from 99:59:59.999 --> 99:59:59.999 the original request response channel which we've been traditionally using 99:59:59.999 --> 99:59:59.999 at Uber for drivers to communicate with the server. 99:59:59.999 --> 99:59:59.999 So this way, we are not affecting any normal business operation 99:59:59.999 --> 99:59:59.999 due to this service. 99:59:59.999 --> 99:59:59.999 So the messaging service then sends the message to the phone 99:59:59.999 --> 99:59:59.999 and get an acknowledgement from them. 99:59:59.999 --> 99:59:59.999 So from this design, what we have achieved is that we've isolated the applications 99:59:59.999 --> 99:59:59.999 from any replication latencies or failures because our replication service 99:59:59.999 --> 99:59:59.999 returns immediately and the only extra thing the application is doing 99:59:59.999 --> 99:59:59.999 by opting in to this replication strategy is making an extra service call 99:59:59.999 --> 99:59:59.999 to the replication service, which is going to be pretty cheap since it's within 99:59:59.999 --> 99:59:59.999 the same datacenter, not traveing through internet. 99:59:59.999 --> 99:59:59.999 Secondly, now having this separate channel gives us the ability to arbitrarily query 99:59:59.999 --> 99:59:59.999 the states of the phone without affecting any normal business operations 99:59:59.999 --> 99:59:59.999 and we can use that phone as a basic key [inaudible] restore now. 99:59:59.999 --> 99:59:59.999 Next...Okay so now comes the issue of moving between datacenters. 99:59:59.999 --> 99:59:59.999 As I said earlier, when we failover we are actually leaving states behind 99:59:59.999 --> 99:59:59.999 in that datacenter. 99:59:59.999 --> 99:59:59.999 So how do datum stay in states? 99:59:59.999 --> 99:59:59.999 So the first approach we tried was actually do some manual cleanup. 99:59:59.999 --> 99:59:59.999 So we wrote some cleanup scripts and every time you failover 99:59:59.999 --> 99:59:59.999 from our priming datatcenter to our backup datacenter, somebody will run that script 99:59:59.999 --> 99:59:59.999 in a priming and it will go to the datastores for the dispatching service 99:59:59.999 --> 99:59:59.999 and it will clean out all the states there. 99:59:59.999 --> 99:59:59.999 However, this approach had operational paying because somebody had to run it. 99:59:59.999 --> 99:59:59.999 Moreover, the allowed ability to failover per city so you can actually choose 99:59:59.999 --> 99:59:59.999 to failover specific cities instead of the whole world, and in those cases 99:59:59.999 --> 99:59:59.999 the script started becoming complicated. 99:59:59.999 --> 99:59:59.999 So then we decided to tweak our design a little bit so that we solve this problem. 99:59:59.999 --> 99:59:59.999 So the first thing we did was...as Josh mentioned earlier, the key which is 99:59:59.999 --> 99:59:59.999 stored on the phone contains the trip identifier and the version within it. 99:59:59.999 --> 99:59:59.999 So the version used to be an incrementing number so that we can keep track of 99:59:59.999 --> 99:59:59.999 any followed progress you're making. 99:59:59.999 --> 99:59:59.999 However, we changed that through a modified retic log. 99:59:59.999 --> 99:59:59.999 So mucing that retic log, we can now compare data on the phone 99:59:59.999 --> 99:59:59.999 and data on the server. 99:59:59.999 --> 99:59:59.999 And if there is a miss--we can detect any causality [inaudible] using that. 99:59:59.999 --> 99:59:59.999 And we can also resolve that using a very basic conflict resolution strategy. 99:59:59.999 --> 99:59:59.999 So this way, we handle any issues with ongoing trips. 99:59:59.999 --> 99:59:59.999 Now next came the issue of computed trips. 99:59:59.999 --> 99:59:59.999 So, traditionally what we'd been doing is, when a trip is completed, will delete 99:59:59.999 --> 99:59:59.999 all the data about the trip from the phone. 99:59:59.999 --> 99:59:59.999 We did that because we didn't want the replication data on the phone 99:59:59.999 --> 99:59:59.999 to grow unbounded. 99:59:59.999 --> 99:59:59.999 And once a trip is completed, it's probably no longer required 99:59:59.999 --> 99:59:59.999 for restoration. 99:59:59.999 --> 99:59:59.999 However, that has side effect that mobile has no idea now that the script 99:59:59.999 --> 99:59:59.999 ever happened. 99:59:59.999 --> 99:59:59.999 So what will happen is if we failback to a datacenter with some stale data 99:59:59.999 --> 99:59:59.999 about this trip, then you might actually end up putting the right driver 99:59:59.999 --> 99:59:59.999 on that same trip, which is a pretty bad experience because he's suddenly 99:59:59.999 --> 99:59:59.999 now driving somebody which he already dropped off, and he's probably 99:59:59.999 --> 99:59:59.999 not gonna be paid for that. 99:59:59.999 --> 99:59:59.999 So what we did to fix that was... on trip completion, we would store 99:59:59.999 --> 99:59:59.999 a special key on the phone, and the version in that key has a flag in it. 99:59:59.999 --> 99:59:59.999 That's why I call it a modified retic log. 99:59:59.999 --> 99:59:59.999 So it has a flag that says that this trip has already been completed, 99:59:59.999 --> 99:59:59.999 and we store that on the phone. 99:59:59.999 --> 99:59:59.999 Now when the replication service sees that this driver has this flag for the trip, 99:59:59.999 --> 99:59:59.999 then can tell the dispatching service that "Hey, this trip is already been completed 99:59:59.999 --> 99:59:59.999 and you should probably delete it." 99:59:59.999 --> 99:59:59.999 So that way, we handle completed trips. 99:59:59.999 --> 99:59:59.999 So if you think about it, storing trip data is kind of expensive because we have 99:59:59.999 --> 99:59:59.999 this huge encrypted log of JSON maybe. 99:59:59.999 --> 99:59:59.999 But we can store the... large completed trips 99:59:59.999 --> 99:59:59.999 because there is no data associated with them. 99:59:59.999 --> 99:59:59.999 So we can probably store weeks worth of completed trips in the same amount 99:59:59.999 --> 99:59:59.999 of memory as we would store one trip data. 99:59:59.999 --> 99:59:59.999 So that's how we solve stale states. 99:59:59.999 --> 99:59:59.999 So now, next comes the issue of ensuring four nines for reliability. 99:59:59.999 --> 99:59:59.999 So we decided to exercise the system more often than a datacenter failure 99:59:59.999 --> 99:59:59.999 because we wanted to get confident that the system actually works. 99:59:59.999 --> 99:59:59.999 So our first approach was to do manual failovers. 99:59:59.999 --> 99:59:59.999 So basically what happened was that bunch of us will gather in a room 99:59:59.999 --> 99:59:59.999 every Monday and then pick a few cities and fail them over. 99:59:59.999 --> 99:59:59.999 And after we fail them over to one of the datacenter, we'll see... 99:59:59.999 --> 99:59:59.999 what was the success rate for the restoration 99:59:59.999 --> 99:59:59.999 and if there were any failures, then try to look at the logs and debug any issues 99:59:59.999 --> 99:59:59.999 there. 99:59:59.999 --> 99:59:59.999 However, there were several problems with this approach. 99:59:59.999 --> 99:59:59.999 First, it was very operationally painful. So we had to do this every week. 99:59:59.999 --> 99:59:59.999 And for a smal fraction of crypts, which did not get restored, 99:59:59.999 --> 99:59:59.999 we will actually have to do fare adjustment 99:59:59.999 --> 99:59:59.999 for both the rider and the driver. 99:59:59.999 --> 99:59:59.999 Secondly, it led to a very poor customer experience 99:59:59.999 --> 99:59:59.999 because for that same fraction, they were suddenly bumped off trip 99:59:59.999 --> 99:59:59.999 and they got totally confused, like what happened to them? 99:59:59.999 --> 99:59:59.999 Thirdly, it had a low coverage because we were covering only a few cities. 99:59:59.999 --> 99:59:59.999 However, in the past we've seen problems which affected only a specific city. 99:59:59.999 --> 99:59:59.999 Maybe because there was a new feature allowed in the city which was not 99:59:59.999 --> 99:59:59.999 [inaudible] yet. 99:59:59.999 --> 99:59:59.999 So this approach does not help us catch those cases until it's too late. 99:59:59.999 --> 99:59:59.999 Finally, we had no idea whether the backup datacenter can handle the load. 99:59:59.999 --> 99:59:59.999 So in our current architecture, we have a primary datacenter which handles 99:59:59.999 --> 99:59:59.999 all the requests and then backup datacenter which is waiting to handle 99:59:59.999 --> 99:59:59.999 all those requests in case the primary goes down. 99:59:59.999 --> 99:59:59.999 But how do we know that the backup datacenter 99:59:59.999 --> 99:59:59.999 can handle all those requests? 99:59:59.999 --> 99:59:59.999 So one way is maybe you can provision the same number of boxes 99:59:59.999 --> 99:59:59.999 and same type of hardware in the backup datacenter. 99:59:59.999 --> 99:59:59.999 But what if there's a configuration issue? 99:59:59.999 --> 99:59:59.999 In some of the services, we would never catch that. 99:59:59.999 --> 99:59:59.999 And even if they're exactly the same, how do you know that each service 99:59:59.999 --> 99:59:59.999 in the backup datacenter can handle a sudden flood of requests which comes 99:59:59.999 --> 99:59:59.999 when there is a failure? 99:59:59.999 --> 99:59:59.999 So we needed some way to fix all these problems. 99:59:59.999 --> 99:59:59.999 So then to understand how to get good confidence in the system and 99:59:59.999 --> 99:59:59.999 to measure it well, we looked at the key concepts behind the system 99:59:59.999 --> 99:59:59.999 which we really wanted to work. 99:59:59.999 --> 99:59:59.999 So first thing was we wanted to ensure that all mutations which are done 99:59:59.999 --> 99:59:59.999 by the dispatching service are actually still on the phone. 99:59:59.999 --> 99:59:59.999 So for example, a driver, right after he picks up a passenger, 99:59:59.999 --> 99:59:59.999 he may lose connectivity. 99:59:59.999 --> 99:59:59.999 And so replication data may not be sent to the phone immediately 99:59:59.999 --> 99:59:59.999 but we want to ensure that the data eventually makes it to the phone. 99:59:59.999 --> 99:59:59.999 Secondly, we wanted to make sure that the stored data can actually be used 99:59:59.999 --> 99:59:59.999 for replication. 99:59:59.999 --> 99:59:59.999 For example, there may be some encryption decryption issue with the data 99:59:59.999 --> 99:59:59.999 and the data gets corrupted and it's no longer needed. 99:59:59.999 --> 99:59:59.999 So even if you're storing the data, you cannot use it. 99:59:59.999 --> 99:59:59.999 So there's no point. 99:59:59.999 --> 99:59:59.999 Or, restoration actually involves rehydrating the states 99:59:59.999 --> 99:59:59.999 within the dispatching service using the data. 99:59:59.999 --> 99:59:59.999 So even if the data is fine, if there's any problem 99:59:59.999 --> 99:59:59.999 during that rehydration process, some service behaving [inaudible], 99:59:59.999 --> 99:59:59.999 you would still have no use for that data and you would still lose the trip, 99:59:59.999 --> 99:59:59.999 even though the data is perfectly fine. 99:59:59.999 --> 99:59:59.999 Finally, as I mentioned earlier, we needed a way to figure out 99:59:59.999 --> 99:59:59.999 whether the backup datacenters can handle the load. 99:59:59.999 --> 99:59:59.999 So to monitor the health of the system better, we wrote another service. 99:59:59.999 --> 99:59:59.999 Every hour it will get a list of all active drivers and trips 99:59:59.999 --> 99:59:59.999 from our dispatching service. 99:59:59.999 --> 99:59:59.999 And for all those drivers, it will use that messaging channel to ask for 99:59:59.999 --> 99:59:59.999 their replication data. 99:59:59.999 --> 99:59:59.999 And once it has the replication data, it will compare that data 99:59:59.999 --> 99:59:59.999 with the data which the application expects. 99:59:59.999 --> 99:59:59.999 And doing that, we get a lot of good metrics around, like... 99:59:59.999 --> 99:59:59.999 What percentage of drivers have data successfully stored to them? 99:59:59.999 --> 99:59:59.999 And you can even break down metrics by region or by any app versions. 99:59:59.999 --> 99:59:59.999 So this really helped us grill into the problem. 99:59:59.999 --> 99:59:59.999 Finally, to know whether the stored data can be used for replication, 99:59:59.999 --> 99:59:59.999 and that the backup datacenter can handle the load, 99:59:59.999 --> 99:59:59.999 what we do is we use all the data which we got in the previous step 99:59:59.999 --> 99:59:59.999 and we send that to our backup datacenter. 99:59:59.999 --> 99:59:59.999 And within the backup datacenter we perform what we call 99:59:59.999 --> 99:59:59.999 a shatter restoration. 99:59:59.999 --> 99:59:59.999 And since there is nobody else making any changes in that backup datacenter, 99:59:59.999 --> 99:59:59.999 after the restoration is completed, we can just query the dispatching service 99:59:59.999 --> 99:59:59.999 in the backup datacenter. 99:59:59.999 --> 99:59:59.999 And say, "Hey, how many active riders, drivers, and trips do you have?" 99:59:59.999 --> 99:59:59.999 And we can compare that number with the number we got 99:59:59.999 --> 99:59:59.999 in our snapshot from the primary datacenter. 99:59:59.999 --> 99:59:59.999 And using that, we get really valuable information around what's our success rate 99:59:59.999 --> 99:59:59.999 and we can do similar breakdowns by different parameters 99:59:59.999 --> 99:59:59.999 like region or app version. 99:59:59.999 --> 99:59:59.999 Finally, we also get metrics around how well the backup datacenter did. 99:59:59.999 --> 99:59:59.999 So did we subject it to a lot of load, or can it handle the traffic 99:59:59.999 --> 99:59:59.999 when there is a real failure? 99:59:59.999 --> 99:59:59.999 Also, any configuration issue in the backup datacenter 99:59:59.999 --> 99:59:59.999 can be easily caught by this approach. 99:59:59.999 --> 99:59:59.999 So using this service, we are constantly testing the system 99:59:59.999 --> 99:59:59.999 and making sure we have confidence in it and can use it during a failure. 99:59:59.999 --> 99:59:59.999 Cause if there's no confidence in the system, then it's pointless. 99:59:59.999 --> 99:59:59.999 So yeah, that was the idea behind the system 99:59:59.999 --> 99:59:59.999 and how we implemented it. 99:59:59.999 --> 99:59:59.999 I did not get a chance to go into different [inaudible] of detail, 99:59:59.999 --> 99:59:59.999 but if you guys have any questions, you can always reach out to us 99:59:59.999 --> 99:59:59.999 during the office hours. 99:59:59.999 --> 99:59:59.999 So thanks guys for coming and listening to us. 99:59:59.999 --> 99:59:59.999 (audience claps)