0:00:24.900,0:00:25.410 0:00:25.410,0:00:26.650 Chad: Yes, hello, thank you. 0:00:26.650,0:00:29.140 Audience member: Hello! 0:00:29.140,0:00:30.810 Chad: Hello! 0:00:30.810,0:00:33.629 I am Chad, as he said. 0:00:33.629,0:00:35.300 He said I need no introduction 0:00:35.300,0:00:37.890 so I won't introduce myself any further. 0:00:37.890,0:00:44.890 I may be the biggest non-Indian fan of India 0:01:01.690,0:01:06.400 [Hindi speech] 0:01:06.400,0:01:13.400 0:01:15.390,0:01:17.830 I'll now switch back, sorry. 0:01:17.830,0:01:20.010 If you don't understand Hindi, I said nothing[br]of value 0:01:20.010,0:01:22.030 and it was all wrong. 0:01:22.030,0:01:23.549 But I was saying that my Hindi is bad 0:01:23.549,0:01:26.000 and it's because now I'm learning German 0:01:26.000,0:01:28.110 so I mixed them together, but I know not everyone 0:01:28.110,0:01:29.440 speaks Hindi here. 0:01:29.440,0:01:32.480 I just had to show off, you know 0:01:32.480,0:01:37.110 So, I am currently working on 6WunderKinder, 0:01:37.110,0:01:40.370 and I'm working on a product called Wunderlist. 0:01:40.370,0:01:42.060 It is a productivity application. 0:01:42.060,0:01:45.860 It runs on every client you can think of. 0:01:45.860,0:01:47.620 We have native clients, we have a back-end, 0:01:47.620,0:01:49.690 we have millions of active users, 0:01:49.690,0:01:51.850 and I'm telling you this not so that you'll[br]go download it - 0:01:51.850,0:01:53.390 you can do that too - 0:01:53.390,0:01:56.960 but I want to tell you about the challenges[br]that I have 0:01:56.960,0:02:00.710 and the way I'm starting to think about system's[br]architecture and design. 0:02:00.710,0:02:03.130 That's what I'm gonna talk about today 0:02:03.130,0:02:05.690 I'm going to show you some things that are[br]real 0:02:05.690,0:02:07.159 and that we're really doing. 0:02:07.159,0:02:09.429 I'm going to show you some things that are 0:02:09.429,0:02:12.670 just a fantasy that maybe don't make any sense[br]at all. 0:02:12.670,0:02:13.980 But hopefully I'll get you think about 0:02:13.980,0:02:15.540 how we think about system architecture 0:02:15.540,0:02:18.370 and how we build things that can last for[br]a long time. 0:02:18.370,0:02:20.870 So the first thing that I want to mention: 0:02:20.870,0:02:23.340 this is a graph from the Standish Chaos report 0:02:23.340,0:02:25.430 and I've taken the years out 0:02:25.430,0:02:27.310 and I've taken some of the raw data out 0:02:27.310,0:02:28.720 because it doesn't matter. 0:02:28.720,0:02:30.530 If you look at these, this graph, 0:02:30.530,0:02:33.379 each one of these bars is a year, 0:02:33.379,0:02:38.159 and each bar represents successful projects[br]in green - 0:02:38.159,0:02:40.079 software projects. 0:02:40.079,0:02:42.409 Challenged projects are in silver or white[br]in the middle 0:02:42.409,0:02:44.249 and then failed ones are in red. 0:02:44.249,0:02:47.340 But challenged means significantly over time[br]or budget 0:02:47.340,0:02:49.349 which to me means failed too. 0:02:49.349,0:02:51.430 So basically we're terrible, 0:02:51.430,0:02:54.279 all of us here, we're terrible. 0:02:54.279,0:02:57.060 We call ourselves engineers but it's a disgrace. 0:02:57.060,0:03:00.840 We very rarely actually launch things that[br]work. 0:03:00.840,0:03:01.389 Kind of sad, 0:03:01.389,0:03:03.829 and I am here to bring you down. 0:03:03.829,0:03:07.169 Then once you launch software, anecdotal-y, 0:03:07.169,0:03:12.359 and you probably would see this in your own[br]work lives, too, 0:03:12.359,0:03:16.230 anecdotal-y, software gets killed after about[br]five years - 0:03:16.230,0:03:17.650 business software. 0:03:17.650,0:03:19.950 So you barely ever get to launch it, because, 0:03:19.950,0:03:23.319 or at least successfully, in a way that you're[br]proud of, 0:03:23.319,0:03:24.650 and then in about five years 0:03:24.650,0:03:27.739 you end up in that situation where you're[br]doing a big rewrite 0:03:27.739,0:03:29.519 and throwing everything away and replacing[br]it. 0:03:29.519,0:03:32.519 You know there's always that project to get[br]rid of the junk, 0:03:32.519,0:03:35.569 old Java code or whatever that you wrote five[br]years ago, 0:03:35.569,0:03:37.180 replace it with Ruby now, 0:03:37.180,0:03:39.909 five years from now you'll be replacing your[br]old junk Ruby code 0:03:39.909,0:03:46.180 that didn't work with something else. 0:03:46.180,0:03:49.379 We create this thing, probably all of you[br]know the term legacy software - 0:03:49.379,0:03:53.340 Right, am I right? You know what legacy software[br]is, 0:03:53.340,0:03:56.139 and you probably think of it as a negative[br]thing. 0:03:56.139,0:03:58.120 You think of it as that ugly code that doesn't[br]work, 0:03:58.120,0:04:02.540 that's brittle, that you can't change, that[br]you're all afraid of. 0:04:02.540,0:04:07.150 But there's actually also a positive connotation[br]of the word legacy: 0:04:07.150,0:04:14.139 it's leaving behind something that future[br]generations can benefit from. 0:04:14.139,0:04:17.370 But if we're rarely ever launching successful[br]projects 0:04:17.370,0:04:20.889 and then the ones we do launch tend to die[br]within five years 0:04:20.889,0:04:24.600 none of us are actually creating a legacy[br]in our work. 0:04:24.600,0:04:27.430 We're just creating stuff that gets thrown[br]away. 0:04:27.430,0:04:29.400 Kind of sad. 0:04:29.400,0:04:32.240 So we create this stuff that's a legacy software. 0:04:32.240,0:04:35.060 It's hard to change, that's why it ends up[br]getting thrown away 0:04:35.060,0:04:37.370 right, that's, if the software worked 0:04:37.370,0:04:40.030 and you could keep changing it to meet the[br]needs of the business 0:04:40.030,0:04:43.979 you wouldn't need to do a big rewrite and[br]throw it away. 0:04:43.979,0:04:47.840 We create these huge tightly-coupled systems, 0:04:47.840,0:04:49.370 and I don't just mean one application, 0:04:49.370,0:04:51.430 but like many applications are all tightly[br]coupled. 0:04:51.430,0:04:55.900 You've got this thing over here talking to[br]the database of this system over here 0:04:55.900,0:04:59.360 so if you change the columns to update the[br]view of a webpage 0:04:59.360,0:05:02.710 you ruin your billing system, that kind of[br]thing 0:05:02.710,0:05:06.270 this is what makes it so hard to change 0:05:06.270,0:05:09.970 and the sad thing about this is the way we[br]work 0:05:09.970,0:05:13.500 the way we develop software, this is the default[br]setting 0:05:13.500,0:05:18.460 and, what I mean is, if we were robots churning[br]out software 0:05:18.460,0:05:20.819 and we had a preferences panel 0:05:20.819,0:05:25.080 the default preferences would lead to us creating[br]terrible software that gets thrown away in 0:05:25.080,0:05:25.699 five years 0:05:25.699,0:05:27.210 that's just how we all work 0:05:27.210,0:05:30.180 as human beings when we sit down to write[br]code 0:05:30.180,0:05:35.430 our default instincts lead to us to create[br]systems that are tightly coupled 0:05:35.430,0:05:41.659 and hard to change and ultimately get thrown[br]away and can't scale 0:05:41.659,0:05:46.060 we create, we try doing tests, we try doing[br]TDD 0:05:46.060,0:05:51.330 but we create test suites that take forty-five[br]minutes to run 0:05:51.330,0:05:52.720 every team has had to deal with this I'm sure 0:05:52.720,0:05:55.990 if you've written any kind of meaningful application 0:05:55.990,0:05:57.970 and it gets to where you have like a project 0:05:57.970,0:05:59.849 to speed up the test suite 0:05:59.849,0:06:02.949 like you start focusing your company's resources 0:06:02.949,0:06:04.949 on making the test suite faster 0:06:04.949,0:06:08.689 or making it like only fail ninety percent[br]of the time 0:06:08.689,0:06:10.949 and then you say well if it only fails ninety[br]percent that's OK 0:06:10.949,0:06:14.550 right, and right now it's taking forty-five[br]minutes 0:06:14.550,0:06:18.180 we want to get it to where it only takes ten[br]minutes to run 0:06:18.180,0:06:24.479 so the test suite ends up being a liability[br]instead of a benefit 0:06:24.479,0:06:25.719 because of the way you do it 0:06:25.719,0:06:29.319 because you have this architect where everything[br]is so coupled 0:06:29.319,0:06:34.939 you can't change anything without spending[br]hours working on the stupid test suite 0:06:34.939,0:06:38.419 and your terrified to deploy 0:06:38.419,0:06:42.960 I know like the last big Java project I was[br]working on 0:06:42.960,0:06:45.819 it would take, once a week we did a deploy 0:06:45.819,0:06:50.139 it would take fifteen people all night to[br]deploy the thing 0:06:50.139,0:06:52.460 and usually it was like copying class files[br]around 0:06:52.460,0:06:54.430 and restarting servers 0:06:54.430,0:06:57.120 it's much better today but it's still terrifying 0:06:57.120,0:06:59.340 you deploy code, you change it in production 0:06:59.340,0:07:01.259 you're not sure what might break 0:07:01.259,0:07:03.719 cause it's really hard to test these big integrated[br]things together 0:07:03.719,0:07:08.650 and actually upgrading the technology component[br]is terrifying 0:07:08.650,0:07:13.289 so, how many of you have been doing Rails[br]for more than three years? 0:07:13.289,0:07:18.400 do you have, like a Rails 2 app in production,[br]anyone? Yeah? 0:07:18.400,0:07:21.710 that's a lot of people, wow, that's terrifying 0:07:21.710,0:07:26.129 and I've been in situations, recently, where[br]we had Rails 2 apps in production 0:07:26.129,0:07:29.560 security patches are coming out, we were applying[br]our own versions 0:07:29.560,0:07:30.879 of those security patches 0:07:30.879,0:07:32.439 because we were afraid to upgrade Rails 0:07:32.439,0:07:35.060 we would rather hack it than upgrade the thing 0:07:35.060,0:07:38.319 because you just don't know what's gonna happen 0:07:38.319,0:07:42.490 and then you end up, as you're re-implementing[br]all this stuff yourself 0:07:42.490,0:07:44.819 you end up burning yourself out, wasting your[br]time 0:07:44.819,0:07:47.990 because you're hacking on stupid Rails 2 0:07:47.990,0:07:50.139 or some old struts version 0:07:50.139,0:07:52.569 when you should be just taking advantage of[br]the new patches 0:07:52.569,0:07:54.639 but you can't because you're afraid to upgrade[br]the software 0:07:54.639,0:07:56.240 because you don't know what's going to happen 0:07:56.240,0:08:02.849 because the system is too big and too scary 0:08:02.849,0:08:04.949 then, and this is really bad, I think this[br]is something 0:08:04.949,0:08:07.009 Ruby messes up for all of us 0:08:07.009,0:08:11.110 I say this as someone who's been using Ruby[br]for thirteen years now 0:08:11.110,0:08:12.740 happily 0:08:12.740,0:08:15.520 we create these mountains of abstractions 0:08:15.520,0:08:17.669 and the logic ends up being buried inside[br]them 0:08:17.669,0:08:22.860 I mean in Java it was like static, or, you[br]know, factories 0:08:22.860,0:08:25.090 and design pattern soup 0:08:25.090,0:08:27.490 in Ruby its modules and mixins and you know 0:08:27.490,0:08:31.050 we have all these crazy ways of hiding what's[br]actually happening from us 0:08:31.050,0:08:33.070 but when you go look at the code 0:08:33.070,0:08:34.360 it's completely opaque 0:08:34.360,0:08:37.090 you have no idea where the stuff actually[br]gets done 0:08:37.090,0:08:40.820 because it's in some magic library somewhere 0:08:40.820,0:08:45.050 and we do all that because we're trying to[br]save ourselves from the complexity of these 0:08:45.050,0:08:47.450 big nasty systems 0:08:47.450,0:08:50.760 but like if you look at the rest of the world 0:08:50.760,0:08:53.810 this is a software specific problem 0:08:53.810,0:08:58.760 these cars are old, they're older than any[br]software that you would ever run 0:08:58.760,0:09:00.340 and they're still driving down the street 0:09:00.340,0:09:03.460 they're older than software itself, right 0:09:03.460,0:09:06.370 but these things still function, they still[br]work 0:09:06.370,0:09:08.970 how? why? why do they work? 0:09:08.970,0:09:11.340 bodies! my body should not work 0:09:11.340,0:09:12.540 I have abused it 0:09:12.540,0:09:13.870 I should not be standing here today 0:09:13.870,0:09:16.660 I shouldn't have been able to come from Berlin[br]here 0:09:16.660,0:09:18.620 without dying somehow by being in the air 0:09:18.620,0:09:23.660 you know, by the air pressure changes 0:09:23.660,0:09:25.950 but our bodies somehow can survive even when 0:09:25.950,0:09:30.730 we don't take care of them 0:09:30.730,0:09:35.290 and like it's just the system that works,[br]right 0:09:35.290,0:09:37.770 so how do our bodies work? 0:09:37.770,0:09:39.440 how do we stay alive 0:09:39.440,0:09:40.930 despite this fact 0:09:40.930,0:09:42.130 even though we haven't done like some 0:09:42.130,0:09:45.270 great design, we don't have any design patterns 0:09:45.270,0:09:49.780 like mixed up into our bodies 0:09:49.780,0:09:53.980 in biology there is a term called homeostasis 0:09:53.980,0:09:56.210 and I literally don't know what this means 0:09:56.210,0:09:57.390 other than this definition 0:09:57.390,0:09:58.870 so you won't learn about this from me 0:09:58.870,0:10:01.060 there's probably at least one biologist in[br]the room 0:10:01.060,0:10:04.370 so you can correct me later 0:10:04.370,0:10:07.870 but basically the idea of homeostasis is 0:10:07.870,0:10:11.430 that an organism has all these different components 0:10:11.430,0:10:13.890 that serve different purposes 0:10:13.890,0:10:15.820 that regulate it 0:10:15.820,0:10:18.260 so they're all kind of in balance 0:10:18.260,0:10:20.750 and they work together to regulate the system 0:10:20.750,0:10:23.700 if one component, like a liver, does too much 0:10:23.700,0:10:24.720 or does the wrong thing 0:10:24.720,0:10:27.840 another component kicks in and fixes it 0:10:27.840,0:10:30.160 and so our bodies are this well designed system 0:10:30.160,0:10:31.960 for staying alive 0:10:31.960,0:10:34.530 because we have almost like autonomous agents 0:10:34.530,0:10:38.810 internally that take care of the many things[br]that can and do go wrong 0:10:38.810,0:10:41.890 on a regular basis 0:10:41.890,0:10:43.770 so you have, you know, your brain, your liver 0:10:43.770,0:10:47.230 your liver, of course, metabolizes toxic substances 0:10:47.230,0:10:50.400 your kidney deals with blood, water level,[br]et cetera 0:10:50.400,0:10:55.660 you know all these things work in concert[br]to make you live 0:10:55.660,0:11:01.140 the inability to continue to do that is known[br]as homeostatic imbalance 0:11:01.140,0:11:04.070 so I was saying, homeostasis is balancing 0:11:04.070,0:11:07.330 not being able to do that is when you're out[br]of balance 0:11:07.330,0:11:10.340 and that will actually lead to really bad[br]health problems 0:11:10.340,0:11:16.410 or probably death, if you fall into homeostatic[br]imbalance 0:11:16.410,0:11:20.420 so the good news is you're already dying 0:11:20.420,0:11:22.450 like we're all dying all the time 0:11:22.450,0:11:26.500 this is the beautiful thing about death 0:11:26.500,0:11:29.110 there is, there is an estimate that fifty[br]trillion cells 0:11:29.110,0:11:31.850 are in your body, and three million die per[br]second 0:11:31.850,0:11:35.520 it's an estimate because it's actually impossible[br]to count 0:11:35.520,0:11:39.520 but scientists have figured out somehow that[br]this is probably the right number 0:11:39.520,0:11:42.310 so your cells, you've probably heard this[br]all your life 0:11:42.310,0:11:45.170 like physically, after some amount of time, 0:11:45.170,0:11:47.430 you aren't the same human being that you were,[br]physically 0:11:47.430,0:11:52.770 you know, I don't know, you some period of[br]time ago 0:11:52.770,0:11:55.500 you're literally not the same organism anymore 0:11:55.500,0:11:58.420 but you're the same system 0:11:58.420,0:12:01.470 kind of interesting, isn't it 0:12:01.470,0:12:06.740 so in a way you can think about software this 0:12:06.740,0:12:08.300 you can think about software as a system 0:12:08.300,0:12:10.820 if the components could be replaced like these[br]cells 0:12:10.820,0:12:17.820 like, if you focus on making death, constant[br]death OK 0:12:18.970,0:12:19.890 on a small level 0:12:19.890,0:12:24.690 then the system can live on a large level 0:12:24.690,0:12:25.760 that's what this talk is about 0:12:25.760,0:12:29.300 solution, the solution being to mimic living[br]organisms 0:12:29.300,0:12:36.110 and as an aside, I will say many times the[br]word small or tiny in this talk 0:12:36.110,0:12:38.480 because I think I'm learning, as I age 0:12:38.480,0:12:39.870 that small is good 0:12:39.870,0:12:42.950 its, small projects are good 0:12:42.950,0:12:44.050 you know how to estimate them 0:12:44.050,0:12:45.110 small commitments are good 0:12:45.110,0:12:46.790 because you know you can make them 0:12:46.790,0:12:47.750 small methods are good 0:12:47.750,0:12:48.790 small classes are good 0:12:48.790,0:12:50.140 small applications are good 0:12:50.140,0:12:52.410 small teams are good 0:12:52.410,0:12:55.270 so I don't know, this is sort of a non sequitur 0:12:55.270,0:12:58.130 so if we're going to think about software 0:12:58.130,0:12:59.750 as like an organism 0:12:59.750,0:13:03.100 what is a cell in that context? 0:13:03.100,0:13:06.360 this is sort of the key question that you[br]have to ask yourself 0:13:06.360,0:13:08.940 and I say that a cell is a tiny component 0:13:08.940,0:13:12.800 now, tiny and component are both subjective[br]words 0:13:12.800,0:13:15.370 so you can kind of do what you want with that 0:13:15.370,0:13:17.670 but it's a good frame of thinking 0:13:17.670,0:13:20.530 if you make your software system of tiny components 0:13:20.530,0:13:22.510 each one can be like a cell 0:13:22.510,0:13:28.010 each one can die and the system is a collection[br]of those tiny components 0:13:28.010,0:13:31.930 and what you want is not for your code to[br]live forever 0:13:31.930,0:13:35.700 you don't care that each line of code lives[br]forever, right 0:13:35.700,0:13:38.830 like if you're trying to develop a legacy[br]in software 0:13:38.830,0:13:42.920 it's not important to you that your system[br]dot out dot printline statement 0:13:42.920,0:13:44.300 lives for ten years 0:13:44.300,0:13:48.050 it's important to you that the function of[br]the system lives for ten years 0:13:48.050,0:13:50.170 so like, about exactly ten years ago 0:13:50.170,0:13:57.170 we created Ruby gems at the RubyConf 2003[br]in Austin, Texas 0:13:59.260,0:14:03.600 I haven't touched Ruby gems myself in like[br]four or five years 0:14:03.600,0:14:04.890 but people are still using it 0:14:04.890,0:14:06.130 they hate it because it's software 0:14:06.130,0:14:07.750 everybody hates software right 0:14:07.750,0:14:10.160 so if you can create software that people[br]hate 0:14:10.160,0:14:13.080 you've succeeded 0:14:13.080,0:14:14.450 but it still exists 0:14:14.450,0:14:16.560 I have no idea if any of the code is the same 0:14:16.560,0:14:17.210 I would assume not 0:14:17.210,0:14:21.350 you know I think, I'm sure that my name is[br]still in it in a copyright notice 0:14:21.350,0:14:23.510 but that's about it 0:14:23.510,0:14:24.890 and that's a beautiful thing 0:14:24.890,0:14:28.380 people are still using it to install Ruby[br]libraries 0:14:28.380,0:14:29.570 and software 0:14:29.570,0:14:35.600 and I don't care if any of my existing, or[br]my initial code is still in the system 0:14:35.600,0:14:36.840 because the system still lives 0:14:36.840,0:14:43.030 so, quite a long time ago now I was researching[br]this kind of question 0:14:43.030,0:14:44.600 about Legacy software 0:14:44.600,0:14:48.390 and I asked a question on Twitter as I often[br]do at conferences 0:14:48.390,0:14:49.910 when I'm preparing 0:14:49.910,0:14:55.610 what are some of the old surviving software[br]systems you regularly use 0:14:55.610,0:14:58.430 and if you look at this, I mean, one thing[br]is obviously 0:14:58.430,0:15:03.290 everyone who answered gave some sort of Unix[br]related answer 0:15:03.290,0:15:06.510 but basically all of these things on this[br]list 0:15:06.510,0:15:13.240 are either systems that are collections of[br]really well-known split-up components 0:15:13.240,0:15:15.700 or they're tiny, tiny programs 0:15:15.700,0:15:18.540 so, like, grep is a tiny program, make 0:15:18.540,0:15:19.640 it only does one thing 0:15:19.640,0:15:23.970 well make is actually also arguably an operating[br]system 0:15:23.970,0:15:27.320 but I won't get into that 0:15:27.320,0:15:29.390 emacs is obviously an operating system, right 0:15:29.390,0:15:33.050 but it's well designed of these tiny little[br]pieces 0:15:33.050,0:15:37.190 so a lot of the old systems I know about follow[br]this pattern 0:15:37.190,0:15:40.190 this metaphor that I'm proposing 0:15:40.190,0:15:42.170 and from my own career 0:15:42.170,0:15:43.530 when I was here before in Banglore 0:15:43.530,0:15:47.250 I worked for GE and some of the people 0:15:47.250,0:15:48.690 we hired even worked on the system there 0:15:48.690,0:15:50.970 we had a system called the Bull 0:15:50.970,0:15:53.700 and it was a Honeywell Bull mainframe 0:15:53.700,0:15:57.280 I doubt any of you have worked on that 0:15:57.280,0:15:58.440 but this one I know you didn't work on 0:15:58.440,0:16:01.070 because it had a custom operating system 0:16:01.070,0:16:03.110 with our own RDVMS 0:16:03.110,0:16:06.260 we had created a PCP stack for it 0:16:06.260,0:16:11.160 using like custom hardware that we plugged[br]into a Windows MT computer 0:16:11.160,0:16:14.930 with some sort of MT queuing system back in[br]the day 0:16:14.930,0:16:17.060 it was this terrifying thing 0:16:17.060,0:16:22.510 when I started working there the system was[br]already something like twenty-five years old 0:16:22.510,0:16:25.630 and I believe even though there have been[br]many, many projects 0:16:25.630,0:16:30.160 to try to kill it, like we had a team called[br]the Bull exit team 0:16:30.160,0:16:33.230 I believe the system is still in production 0:16:33.230,0:16:37.070 not as much as it used to be, there are less[br]and less functions in production 0:16:37.070,0:16:39.190 but I believe the system is still in production 0:16:39.190,0:16:46.190 the reason for this is that the system was[br]actually made up of these tiny little components 0:16:47.070,0:16:50.540 and like really queer interfaces between them 0:16:50.540,0:16:53.950 and we kept the system live because every[br]time we tried to replace it 0:16:53.950,0:16:57.290 with some fancy new gem, web thing or gooey[br]app 0:16:57.290,0:16:59.470 it wasn't as good, and the users hated it 0:16:59.470,0:17:00.740 it just didn't work 0:17:00.740,0:17:04.789 so we had to use this old, crazy, modified[br]mainframe 0:17:04.789,0:17:08.150 for a long time as a result 0:17:08.150,0:17:10.890 so, the question I ask myself is now 0:17:10.890,0:17:13.429 how do I, how do I approach a problem like[br]this 0:17:13.429,0:17:19.000 and build a system that can survive for a[br]long time 0:17:19.000,0:17:20.049 I would encourage you 0:17:20.049,0:17:22.589 how many of you know of Fred George 0:17:22.589,0:17:24.720 this is Fred George 0:17:24.720,0:17:25.900 he was at ThoughtWorks for awhile 0:17:25.900,0:17:27.669 so he may have, I think he lived in Banglore 0:17:27.669,0:17:31.150 for some time with ThoughtWorks, in fact 0:17:31.150,0:17:35.050 he is now running a start-up in Silicon Valley 0:17:35.050,0:17:38.600 but he has this talk that you can watch online 0:17:38.600,0:17:41.660 from the Barcelona Ruby Conference the year[br]before last 0:17:41.660,0:17:45.030 called Microservice Architectures 0:17:45.030,0:17:47.760 and he talks in great detail about he, 0:17:47.760,0:17:50.340 how he implemented a concept at forward 0:17:50.340,0:17:52.150 that's very much like what I'm talking about 0:17:52.150,0:17:55.130 tiny components that only do one thing and[br]can be thrown away 0:17:55.130,0:17:59.890 so Microservice Architecture is kind of the[br]core of what I'm gonna talk about 0:17:59.890,0:18:02.080 now I've put together some rules for 6WunderKinder 0:18:02.080,0:18:04.110 which I am going to share with you 0:18:04.110,0:18:07.110 6WunderKinder is the company I work for 0:18:07.110,0:18:09.220 when we're working on Wunderlist 0:18:09.220,0:18:11.940 and the rules of the, the goals of these rules 0:18:11.940,0:18:16.690 are to reduce coupling, to make it where we[br]can do fear-free deployments 0:18:16.690,0:18:19.190 we reduce the chance of "cruft" in our code 0:18:19.190,0:18:20.680 like nasty stuff that you're afraid of 0:18:20.680,0:18:24.670 that you leave there, kind of broken window[br]problems 0:18:24.670,0:18:28.660 we make it literally trivial to change code 0:18:28.660,0:18:32.680 so you just never have to ask how do I do[br]that 0:18:32.680,0:18:33.990 you just find it easy 0:18:33.990,0:18:39.140 and most importantly we give ourselves the[br]freedom to go fast 0:18:39.140,0:18:43.680 because I think no developer ever wants to[br]be slow 0:18:43.680,0:18:44.670 that's one of the worst things 0:18:44.670,0:18:47.750 just toiling away and not actually accomplishing[br]anything 0:18:47.750,0:18:50.920 but we go slow because we're constrained by[br]the system 0:18:50.920,0:18:54.110 and we're constrained by, sometimes projects 0:18:54.110,0:18:56.010 and other, you know, management related things 0:18:56.010,0:19:01.230 but often times its the mess of the system[br]that we've created 0:19:01.230,0:19:03.730 so some of the rules 0:19:03.730,0:19:09.480 I think one thing, and maybe, maybe I'm going[br]to get some push back from this crowd 0:19:09.480,0:19:13.170 one rule that is less controversial than it[br]used to be 0:19:13.170,0:19:15.050 is that comments are a design smell 0:19:15.050,0:19:19.270 does anyone strongly disagree with that? 0:19:19.270,0:19:20.750 no? 0:19:20.750,0:19:23.930 does anyone strongly agree with that? 0:19:23.930,0:19:27.210 OK, so the rest of you have no idea what I'm[br]talking about 0:19:27.210,0:19:33.240 so a design smell, I want to define this really[br]quickly 0:19:33.240,0:19:36.860 a design smell is something you see in your[br]code or your system 0:19:36.860,0:19:39.600 where it doesn't necessarily mean it's bad 0:19:39.600,0:19:40.730 but you look at it and you think 0:19:40.730,0:19:43.270 hmm, I should look into this a little bit 0:19:43.270,0:19:45.930 and ask myself, why are there so many comments[br]in this code? 0:19:45.930,0:19:48.300 you know, especially the bottom one 0:19:48.300,0:19:50.650 inline comments? 0:19:50.650,0:19:56.810 definitely bad, definitely a sign that you[br]should have another method, right 0:19:56.810,0:19:59.040 so it's pretty easy to convince people 0:19:59.040,0:20:00.430 that comments are a design smell 0:20:00.430,0:20:02.010 and I think a lot of people in the industry 0:20:02.010,0:20:03.490 are starting to agree 0:20:03.490,0:20:05.280 maybe not for like a public library 0:20:05.280,0:20:06.800 where you really need to tell someone 0:20:06.800,0:20:09.660 here's how you use this class and this is[br]what it's for 0:20:09.660,0:20:12.470 but you shouldn't have to document every method 0:20:12.470,0:20:15.440 and every argument because the method name[br]and the argument name 0:20:15.440,0:20:18.170 should speak for themselves, right 0:20:18.170,0:20:20.670 so here's one that you probably won't agree[br]with 0:20:20.670,0:20:21.940 tests are a design smell 0:20:21.940,0:20:28.580 so this one is probably a little more controversial 0:20:28.580,0:20:32.570 especially in an environment where you're[br]maybe still struggling people 0:20:32.570,0:20:37.910 struggling with people to actually get them[br]to write tests to begin with, right 0:20:37.910,0:20:41.330 you know I went through this period in, like,[br]2000 and 2001 0:20:41.330,0:20:44.090 where I was really heavily into evangelizing[br]TDD 0:20:44.090,0:20:47.190 and it was really stressful that you couldn't[br]get anyone to do it 0:20:47.190,0:20:49.520 I think you do have to go through that period 0:20:49.520,0:20:52.110 and I'm not saying you shouldn't write any[br]tests 0:20:52.110,0:20:57.180 but that picture I showed you earlier of the[br]slow, brittle test suite 0:20:57.180,0:20:58.300 that's bad, right 0:20:58.300,0:21:00.910 that is a bad state to be in 0:21:00.910,0:21:03.890 and you're in that state because your tests[br]suck 0:21:03.890,0:21:05.850 that's why you get in that state 0:21:05.850,0:21:09.570 your tests suck because you're writing bad[br]tests 0:21:09.570,0:21:15.910 that don't exercise the right things in your[br]system 0:21:15.910,0:21:18.960 and what I've found is whenever I look into[br]one of these 0:21:18.960,0:21:21.809 big slow brittle test suites 0:21:21.809,0:21:25.180 the tests themselves are indications 0:21:25.180,0:21:28.240 and the sheer proliferation of tests 0:21:28.240,0:21:30.940 are indications that the system is bad 0:21:30.940,0:21:33.590 and the developers are like desperately 0:21:33.590,0:21:36.980 fearfully trying to run the code 0:21:36.980,0:21:38.650 in every way they can 0:21:38.650,0:21:40.660 because it's the only way they can manage 0:21:40.660,0:21:43.980 to even think about the complexity 0:21:43.980,0:21:47.720 but if you think about it, if you had a tiny[br]trivial system 0:21:47.720,0:21:50.059 you wouldn't need to have hundreds of test[br]files 0:21:50.059,0:21:53.059 that take ten minutes to run, ever 0:21:53.059,0:21:54.480 if you did, you're doing something stupid 0:21:54.480,0:21:57.020 you're wasting your time working on tests 0:21:57.020,0:22:00.110 and we as software developers obsess about[br]this kind of thing 0:22:00.110,0:22:04.770 because we have to fight so hard to get our[br]peers to do it in the first place 0:22:04.770,0:22:06.370 and to understand it 0:22:06.370,0:22:10.050 we obsess to the point where we focus on the[br]wrong thing 0:22:10.050,0:22:14.660 none of us are in the business of writing[br]tests for customers 0:22:14.660,0:22:17.559 like we're not launching our tests on the[br]web 0:22:17.559,0:22:20.000 and hoping people will buy them, right 0:22:20.000,0:22:23.900 it doesn't provide value, it's just a side-effect 0:22:23.900,0:22:25.780 that we have focused too heavily on 0:22:25.780,0:22:29.930 and we've lost sight of what the actual goal[br]is 0:22:29.930,0:22:34.210 so, this one actually requires a visual 0:22:34.210,0:22:37.100 I tell the people on my team now 0:22:37.100,0:22:40.340 you can write code in any language you want 0:22:40.340,0:22:42.760 any framework you want, anything you want[br]to do 0:22:42.760,0:22:44.559 as long as the code is this big 0:22:44.559,0:22:47.490 so if you want to write the new service in[br]Haskell 0:22:47.490,0:22:50.059 and it's this big in a normal size font 0:22:50.059,0:22:51.470 you can do it 0:22:51.470,0:22:54.260 if you want to do it in Closure or Elixir[br]or Scarla or Ruby 0:22:54.260,0:22:55.050 or whatever you want to do 0:22:55.050,0:22:56.820 even Python for god's sake 0:22:56.820,0:22:59.230 you can do it if it's this big and no bigger 0:22:59.230,0:23:04.010 why? because it means I can look at it 0:23:04.010,0:23:05.620 and I can understand it 0:23:05.620,0:23:08.730 or if I don't I'll just throw it away 0:23:08.730,0:23:12.100 because if it's this big it doesn't do very[br]much, right 0:23:12.100,0:23:14.450 so the risk is really low 0:23:14.450,0:23:16.809 and I really mean the system is that 0:23:16.809,0:23:19.130 there are the, the component is that big 0:23:19.130,0:23:21.070 and in my world a component means a service 0:23:21.070,0:23:24.710 that's running and probably listening on an[br]HTTP board 0:23:24.710,0:23:27.820 or some sort of rift or RPC protocol 0:23:27.820,0:23:29.520 so it's a standalone thing 0:23:29.520,0:23:30.680 it's its own application 0:23:30.680,0:23:33.130 it's probably in its own git repository 0:23:33.130,0:23:34.950 people do poll requests against it 0:23:34.950,0:23:35.820 but it's just tiny 0:23:35.820,0:23:39.110 so this big 0:23:39.110,0:23:41.200 at the top of this, by the way 0:23:41.200,0:23:45.720 is some code by Konstantin Haase 0:23:45.720,0:23:48.720 who also lives in Berlin, where I live 0:23:48.720,0:23:51.480 this is a rewrite of Sinatra 0:23:51.480,0:23:52.430 the web framework 0:23:52.430,0:23:55.450 and Konstantin is actually the maintainer[br]of Sinatra 0:23:55.450,0:23:58.870 it's not fully compatible, but it's amazingly[br]close 0:23:58.870,0:24:00.260 and it all fits right in that 0:24:00.260,0:24:05.020 but the font size is kind of small, so I cheated 0:24:05.020,0:24:08.550 another rule, our systems are heterogeneous[br]by default 0:24:08.550,0:24:11.420 so I say you can write in any language you[br]want 0:24:11.420,0:24:14.050 that's not just because I want the developers[br]to be excited 0:24:14.050,0:24:16.650 although I think, most of you, if you worked 0:24:16.650,0:24:19.390 in an environment where your boss told you 0:24:19.390,0:24:21.500 you can use any programming language or tool[br]you want 0:24:21.500,0:24:23.809 you would be pretty happy about that, right 0:24:23.809,0:24:26.590 anyone unhappy about that? I don't think so 0:24:26.590,0:24:28.100 unless it's one of the bosses here 0:24:28.100,0:24:31.679 that's like don't tell people that 0:24:31.679,0:24:32.570 so that's one thing 0:24:32.570,0:24:36.880 the other one is, it leads to a good system[br]design 0:24:36.880,0:24:38.840 because think about this 0:24:38.840,0:24:42.350 if I write one program in Erlang, one component[br]in Erlang 0:24:42.350,0:24:44.410 one program in Ruby 0:24:44.410,0:24:47.710 I have to work really, really hard to make[br]tight coupling 0:24:47.710,0:24:49.650 between those things 0:24:49.650,0:24:53.340 like I have to basically use computer science[br]to do that 0:24:53.340,0:24:54.370 I don't even know what I would do 0:24:54.370,0:24:55.929 you know it's hard 0:24:55.929,0:24:58.590 like I would have to maybe implement Ruby[br]in Erlang 0:24:58.590,0:25:01.140 so that it can run in the same BM or vice[br]versa 0:25:01.140,0:25:04.059 it's just silly, I wouldn't do it 0:25:04.059,0:25:07.050 so if my system is heterogeneous by default 0:25:07.050,0:25:11.960 my coupling is very low, at least at a certain[br]level by default 0:25:11.960,0:25:14.170 because it's the path of least resistance 0:25:14.170,0:25:16.679 is to make the system decoupled 0:25:16.679,0:25:19.300 it's easier to make things decoupled than[br]coupled 0:25:19.300,0:25:21.510 if they're all running in different languages 0:25:21.510,0:25:25.210 so in the past three months, I'll say 0:25:25.210,0:25:30.490 I have written production code in objective[br]CRuby, Scala, Closure, Node 0:25:30.490,0:25:34.059 I don't know, more stuff, Java 0:25:34.059,0:25:35.670 all these different languages 0:25:35.670,0:25:38.809 real code for work 0:25:38.809,0:25:40.550 and yes, they are not tightly coupled 0:25:40.550,0:25:44.650 like I haven't installed JRuby so that I could[br]reach into the internals of my Scala code 0:25:44.650,0:25:45.630 because that would be a pain 0:25:45.630,0:25:50.730 I don't want to do that 0:25:50.730,0:25:52.960 another very important one is 0:25:52.960,0:25:55.559 server nodes are disposable 0:25:55.559,0:25:59.429 so, back when I was at GE, for example 0:25:59.429,0:26:02.730 I remember being really proud when I looked[br]at the up time of one of my servers 0:26:02.730,0:26:05.480 and it was like four hundred days or something 0:26:05.480,0:26:07.150 it's like, wow, this is awesome 0:26:07.150,0:26:09.750 I have this big server, it had all these apps[br]on it 0:26:09.750,0:26:12.940 we kept it running for four hundred days 0:26:12.940,0:26:14.809 the problem with that is I was afraid to ever[br]touch it 0:26:14.809,0:26:17.510 I was really happy it was alive 0:26:17.510,0:26:18.860 but I didn't want to do anything to it 0:26:18.860,0:26:21.250 I was afraid to update the operating system 0:26:21.250,0:26:23.770 in fact you could not upgrade Solaris then[br]without restarting it 0:26:23.770,0:26:27.540 so that meant I had not upgrading the operating[br]system 0:26:27.540,0:26:32.390 I probably shouldn't have been too proud about[br]it 0:26:32.390,0:26:34.890 Nodes that are alive for a long time lead[br]to fear 0:26:34.890,0:26:37.440 and what I want is less fear 0:26:37.440,0:26:39.340 so I throw them away 0:26:39.340,0:26:42.900 and this means I don't have physical servers[br]that I throw away 0:26:42.900,0:26:45.920 that would be fun but I'm not that rich yet 0:26:45.920,0:26:49.160 we use AWS right now, you could do it with[br]any kind of cloud service 0:26:49.160,0:26:52.640 or even internal cloud divider 0:26:52.640,0:26:53.780 but every node is disposable 0:26:53.780,0:27:00.550 so, we never upgrade software on an existing[br]server 0:27:00.550,0:27:03.150 whenever you want to deploy a new version[br]of a service 0:27:03.150,0:27:04.370 you create new servers 0:27:04.370,0:27:05.429 and you deploy that version 0:27:05.429,0:27:08.790 and then you replace them in the load balance[br]or somewhere 0:27:08.790,0:27:10.200 that's it 0:27:10.200,0:27:13.100 so, you never have to wonder what's on a server 0:27:13.100,0:27:15.620 because it was deployed through an automated[br]process 0:27:15.620,0:27:16.840 and there's no fear there 0:27:16.840,0:27:17.980 you know exactly what it is 0:27:17.980,0:27:19.320 you know exactly how to recreate it 0:27:19.320,0:27:21.540 because you have a golden master image 0:27:21.540,0:27:24.200 and in our case it's actually an Amazon image 0:27:24.200,0:27:26.380 that you can just boot more of 0:27:26.380,0:27:27.440 scaling is a problem 0:27:27.440,0:27:29.070 you just boot ten more servers 0:27:29.070,0:27:32.520 boom, done, no problem 0:27:32.520,0:27:35.450 so yeah I tell the team, you know, pick your[br]technology 0:27:35.450,0:27:38.090 everything must be automated, that's another[br]piece 0:27:38.090,0:27:43.059 if you're going to deploy a closure service[br]for the first time 0:27:43.059,0:27:46.760 you have to be responsible for figuring out[br]how it fits into our deployment system 0:27:46.760,0:27:50.309 so that you have immutable deployments and[br]disposable nodes 0:27:50.309,0:27:53.760 if you can do that and you're willing to also[br]maintain it and teach someone else 0:27:53.760,0:27:55.910 about the little piece of code that you wrote,[br]then cool 0:27:55.910,0:27:59.010 you can do it, any level you want 0:27:59.010,0:28:02.929 and then once you deploy stuff 0:28:02.929,0:28:05.250 like a lot of us like to just SFH in the machines 0:28:05.250,0:28:07.679 and then twiddle with things and replace files 0:28:07.679,0:28:11.660 and like try like fixing bugs live on production 0:28:11.660,0:28:13.990 why no just throw away the actual keys 0:28:13.990,0:28:16.590 because you're going to throw away the system[br]eventually 0:28:16.590,0:28:19.140 you don't even need route access to it 0:28:19.140,0:28:21.490 you don't need to be able to get to it 0:28:21.490,0:28:24.980 except through the port that your service[br]is listening on 0:28:24.980,0:28:26.840 so you can't screw it up 0:28:26.840,0:28:29.470 you can't introduce entropy and mess things[br]up 0:28:29.470,0:28:31.470 if you throw away the keys 0:28:31.470,0:28:33.640 so this is actually a practice that you can[br]do 0:28:33.640,0:28:36.460 deploy the servers, remove all the credentials 0:28:36.460,0:28:39.299 for logging in and the only option you have 0:28:39.299,0:28:43.610 is to destroy them when you're done with them 0:28:43.610,0:28:45.140 provisioning new services in our world 0:28:45.140,0:28:46.960 must also be trivial 0:28:46.960,0:28:51.370 so we have actually now thrown away our chef[br]repository 0:28:51.370,0:28:54.340 because chef is obsolete and 0:28:54.340,0:28:56.049 we have replaced it with shell scripts 0:28:56.049,0:29:01.340 and that sounds like I'm an idiot 0:29:01.340,0:29:04.460 I know, but when I say chef is obsolete 0:29:04.460,0:29:05.480 I don't really mean that 0:29:05.480,0:29:07.100 I like to say that so that people will think 0:29:07.100,0:29:08.450 because a lot of you are probably thinking 0:29:08.450,0:29:11.040 we should move to chef 0:29:11.040,0:29:11.809 that would be great 0:29:11.809,0:29:13.530 because what you have is a bunch of servers 0:29:13.530,0:29:14.670 that are running for a long time 0:29:14.670,0:29:17.110 and you need to be able to continue to keep[br]them up to date 0:29:17.110,0:29:19.150 chef is really great at that 0:29:19.150,0:29:22.059 chef is also good at booting a new server 0:29:22.059,0:29:24.340 but really it's just overkill for that 0:29:24.340,0:29:25.059 yeah 0:29:25.059,0:29:26.460 so if you're always throwing stuff away 0:29:26.460,0:29:27.809 I don't think you need chef 0:29:27.809,0:29:29.160 do something really, really simple 0:29:29.160,0:29:29.950 and that's what we've done 0:29:29.950,0:29:33.090 so like whenever we deploy a new type of service 0:29:33.090,0:29:37.730 I set up ZooKepper recently, which is a complete[br]change from the other stuff we're deploying 0:29:37.730,0:29:39.980 I think it was a five line shell script to[br]do that 0:29:39.980,0:29:42.590 I just added it to a get repo and run a command 0:29:42.590,0:29:47.340 I've got a cluster of ZooKeeper servers running 0:29:47.340,0:29:51.260 you want to always be deploying your software 0:29:51.260,0:29:55.570 this is something I learned from Kent Beck[br]early on in the agile extreme programming 0:29:55.570,0:29:56.330 world 0:29:56.330,0:29:57.980 that if something is hard 0:29:57.980,0:30:00.420 or you perceive it to be hard or difficult 0:30:00.420,0:30:02.290 the best thing you can do 0:30:02.290,0:30:04.390 if you have to do that thing all the time 0:30:04.390,0:30:07.000 is to just do it constantly 0:30:07.000,0:30:09.090 non-stop all the time 0:30:09.090,0:30:10.910 so like deploying in our old world 0:30:10.910,0:30:15.280 where it would take all night once a week 0:30:15.280,0:30:18.040 if we instituted a new policy 0:30:18.040,0:30:19.270 in that team that said 0:30:19.270,0:30:23.100 any change that goes to master must be deployed[br]within five minutes 0:30:23.100,0:30:28.410 I guarantee you we would have fixed that process,[br]right 0:30:28.410,0:30:29.730 and if you're deploying constantly 0:30:29.730,0:30:31.080 all day every day 0:30:31.080,0:30:33.120 you're never going to be afraid of deployments 0:30:33.120,0:30:36.020 because it's always a small change 0:30:36.020,0:30:37.929 so always be deploying 0:30:37.929,0:30:40.410 every new deploy means you're throwing away[br]old servers 0:30:40.410,0:30:42.600 and replacing them with new ones 0:30:42.600,0:30:45.610 in our world I would say that the average[br]uptime 0:30:45.610,0:30:48.240 of one of our servers is probably something[br]like 0:30:48.240,0:30:55.179 seventeen hours and that's because we don't[br]tend to work on the weekend very much 0:30:55.179,0:30:56.870 you also, when you have these sorts of systems 0:30:56.870,0:30:58.710 that are distributed like this 0:30:58.710,0:31:02.100 and you're trying to reduce the fear of change 0:31:02.100,0:31:04.350 the big thing that you're afraid of is failure 0:31:04.350,0:31:06.110 you're afraid that the service is going to[br]fail 0:31:06.110,0:31:07.110 the system is going to go down 0:31:07.110,0:31:10.070 one component won't be reachable, that sort[br]of thing 0:31:10.070,0:31:12.370 so you just to have assume that that's going[br]to happen 0:31:12.370,0:31:17.210 you are not going to build a system that never[br]fails, ever 0:31:17.210,0:31:19.740 I hope you don't, because you will have wasted[br]much of your life 0:31:19.740,0:31:21.100 trying to get that to happen 0:31:21.100,0:31:24.309 instead, assume that the thing, the components[br]are going to fail 0:31:24.309,0:31:25.960 and build resiliency in 0:31:25.960,0:31:28.030 I have a picture here of Joe Armstrong 0:31:28.030,0:31:30.380 who is one of the inventors of Erlang 0:31:30.380,0:31:34.890 if you have not studied Erlang philosophy[br]around failure and recovery 0:31:34.890,0:31:35.340 you should 0:31:35.340,0:31:36.470 and it won't take you long 0:31:36.470,0:31:39.070 so I'm just going to leave that as homework[br]for you 0:31:39.070,0:31:42.110 and then, you know, I said, the tests are[br]a design pattern 0:31:42.110,0:31:43.540 I don't mean don't write any tests 0:31:43.540,0:31:45.950 but I also want to be further responsible[br]here 0:31:45.950,0:31:50.540 and say you should monitor everything 0:31:50.540,0:31:52.880 you want to favor measurement over testing 0:31:52.880,0:31:57.130 so I use measurement as a surrogate for testing 0:31:57.130,0:31:57.850 or as an enhancement 0:31:57.850,0:32:03.980 and the reason I say this is 0:32:03.980,0:32:05.650 you can either focus on one of two things 0:32:05.650,0:32:07.790 I said assume failure right, so 0:32:07.790,0:32:12.370 mean time between failures or mean time to[br]resolution 0:32:12.370,0:32:16.200 those are kind of two metrics in the ops world 0:32:16.200,0:32:17.400 that people talk about 0:32:17.400,0:32:20.140 for measuring their success and their effectiveness 0:32:20.140,0:32:21.980 mean time between failures means 0:32:21.980,0:32:25.360 you're trying to increase the time between[br]failures 0:32:25.360,0:32:29.290 of the system, so basically you're trying[br]to make failures never happen, right 0:32:29.290,0:32:31.059 mean time to resolution means 0:32:31.059,0:32:34.679 when they happen, I'm gonna focus on bringing[br]them back 0:32:34.679,0:32:37.290 as fast as I possibly can 0:32:37.290,0:32:41.120 so a perfect example would be a system fails 0:32:41.120,0:32:43.720 and another one is already up and just takes[br]over its work 0:32:43.720,0:32:46.679 mean time to resolution is essentially zero,[br]right 0:32:46.679,0:32:50.679 if you're always assuming that every component[br]can will fail 0:32:50.679,0:32:53.770 then mean time to resolution is going to be[br]really good 0:32:53.770,0:32:56.240 because you're going to bake it into the process 0:32:56.240,0:32:59.480 if you do that, you don't care about when[br]things fail 0:32:59.480,0:33:02.640 and back to this idea of favoring measurement[br]over testing 0:33:02.640,0:33:07.250 if you're monitoring everything, everything[br]with intelligence 0:33:07.250,0:33:10.390 then you're actually focusing on mean time[br]to resolution 0:33:10.390,0:33:15.750 and acknowledging that the software is going[br]to be broken sometimes, right 0:33:15.750,0:33:18.200 and when I say monitor everything, I mean[br]everything 0:33:18.200,0:33:21.940 I don't mean, like your disk space and your[br]memory and stuff there 0:33:21.940,0:33:23.669 I'm talking about business metrics 0:33:23.669,0:33:27.630 so, at living social we created this thing[br]called rearview 0:33:27.630,0:33:29.250 which is now opensource 0:33:29.250,0:33:33.030 which allows you do to aberration detection 0:33:33.030,0:33:37.919 and aberration means strange behavior, strange[br]change in behavior 0:33:37.919,0:33:41.679 so rearview can do aberration detection 0:33:41.679,0:33:44.690 on data sets, arbitrary data sets 0:33:44.690,0:33:47.010 which means, like in the living social world 0:33:47.010,0:33:48.230 we had user sign ups 0:33:48.230,0:33:49.190 constantly streaming in 0:33:49.190,0:33:51.559 it was a very high volume site 0:33:51.559,0:33:53.799 if user sign-ups were weird 0:33:53.799,0:33:55.940 we would get an alert 0:33:55.940,0:33:57.540 why might they be weird? 0:33:57.540,0:34:00.830 one thing could be like the user service is[br]down, right 0:34:00.830,0:34:02.100 so then we would get two alerts 0:34:02.100,0:34:04.010 user sign ups have gone down 0:34:04.010,0:34:05.150 and so has the service 0:34:05.150,0:34:07.510 so obviously the problem is the service is[br]down 0:34:07.510,0:34:09.679 let's bring it back up 0:34:09.679,0:34:11.409 but it could be something like 0:34:11.409,0:34:13.349 a front-end developer or a designer 0:34:13.349,0:34:16.469 made a change that was intentional 0:34:16.469,0:34:18.040 but it just didn't work and no one liked it 0:34:18.040,0:34:21.168 so they didn't sign up to the site anymore 0:34:21.168,0:34:23.980 that's more important than just knowing that[br]the service is down 0:34:23.980,0:34:25.460 right, because what you care about 0:34:25.460,0:34:27.190 isn't that the service is up or down 0:34:27.190,0:34:30.540 if you could crash the entire system and still[br]be making money 0:34:30.540,0:34:31.859 you don't care, right, that's better 0:34:31.859,0:34:34.839 throw it away and stop paying for the servers 0:34:34.839,0:34:40.679 but if your system is up 100% of the time[br]and performs excellently 0:34:40.679,0:34:43.359 but no one's using it, that's bad 0:34:43.359,0:34:49.279 so monitoring business metrics gives you a[br]lot more than unit test could ever give you 0:34:49.279,0:34:50.899 and then in our world 0:34:50.899,0:34:51.859 we focused on experiencing 0:34:51.859,0:34:56.259 no, you have to come up to front and say ten! 0:34:56.259,0:34:59.220 ok, ten minutes left 0:34:59.220,0:35:01.989 when I got to 6WunderKinder in Berlin 0:35:01.989,0:35:04.069 everyone was terrified to touch the system 0:35:04.069,0:35:08.710 because they hadn't created a really well-designed 0:35:08.710,0:35:12.009 but traditional monolithic API 0:35:12.009,0:35:13.539 so they had layers of abstractions 0:35:13.539,0:35:15.289 it was all kind of in one big thing 0:35:15.289,0:35:16.519 they had a huge database 0:35:16.519,0:35:19.720 and they were really, really scared to do[br]anything 0:35:19.720,0:35:22.190 so there's like one person who would deploy[br]anything 0:35:22.190,0:35:24.190 and everyone else was trying to work on other[br]projects 0:35:24.190,0:35:25.950 and not touch it 0:35:25.950,0:35:27.859 but it was like the production system 0:35:27.859,0:35:29.960 you know so it wasn't really an option 0:35:29.960,0:35:31.880 so the first thing I did in my first week 0:35:31.880,0:35:34.920 is I got these graphs going 0:35:34.920,0:35:39.239 and this was, yeah, response time 0:35:39.239,0:35:42.749 and the first thing I did is I started turning[br]off servers 0:35:42.749,0:35:44.279 and just watching the graphs 0:35:44.279,0:35:47.749 and then, as I was turning off the servers 0:35:47.749,0:35:49.380 I went to the production database 0:35:49.380,0:35:54.220 and I did select, count, star from tasks 0:35:54.220,0:35:55.650 and we're a task management app 0:35:55.650,0:35:58.249 so we have hundreds of millions of tasks 0:35:58.249,0:36:00.910 and the whole thing crashed 0:36:00.910,0:36:04.119 and all the people were like AAAAH what's[br]going on 0:36:04.119,0:36:05.630 you know, and I said, it's no problem 0:36:05.630,0:36:08.539 I did this on purpose, I'll just make it come[br]back 0:36:08.539,0:36:10.119 which I did 0:36:10.119,0:36:11.079 and from that point on 0:36:11.079,0:36:13.349 like, really every day I would do something 0:36:13.349,0:36:16.999 which basically crash the system for just[br]a moment 0:36:16.999,0:36:19.819 and really, like, we had way too many servers[br]in production 0:36:19.819,0:36:22.690 we were spending tens of thousands more Euros[br]per month 0:36:22.690,0:36:25.079 than we should have on the infrastructure 0:36:25.079,0:36:27.499 and I just started taking things away 0:36:27.499,0:36:28.819 and I would usually do it 0:36:28.819,0:36:30.579 instead of the responsible way, 0:36:30.579,0:36:31.630 like one server at a time 0:36:31.630,0:36:34.079 I would just remove all of them and start[br]adding them back 0:36:34.079,0:36:36.220 so for a moment everything was down 0:36:36.220,0:36:38.809 but after that we go to a point where 0:36:38.809,0:36:41.299 everyone on the team was absolutely comfortable 0:36:41.299,0:36:42.720 with the worst case scenario 0:36:42.720,0:36:45.180 of the system being completely down 0:36:45.180,0:36:47.989 so that we could, in a panic free way 0:36:47.989,0:36:51.059 just focus on bringing it up when it was bad 0:36:51.059,0:36:52.940 so now when you do a deployment 0:36:52.940,0:36:54.710 and you have your business metrics being measured 0:36:54.710,0:36:57.160 you know the important stuff is happening 0:36:57.160,0:37:00.559 and you know what to do when everything is[br]down 0:37:00.559,0:37:02.509 you've experienced the worst thing that can[br]happen 0:37:02.509,0:37:04.690 well the worst thing is like someone breaks[br]in 0:37:04.690,0:37:07.789 and steals all your stuff, steals all your[br]users' phone numbers 0:37:07.789,0:37:10.140 and posts them online like SnapChat or something 0:37:10.140,0:37:13.650 but you've experienced all these potentially[br]horrible things 0:37:13.650,0:37:16.920 and realized, eh, it's not so bad, I can deal[br]with this 0:37:16.920,0:37:19.119 I know what do to 0:37:19.119,0:37:22.400 it allows you to start making bold moves 0:37:22.400,0:37:23.640 and that's what we all want right 0:37:23.640,0:37:28.739 we all want to be able to bravely go into[br]our systems 0:37:28.739,0:37:30.319 and do anything we think is right 0:37:30.319,0:37:33.869 so that's what I've been focusing on 0:37:33.869,0:37:36.769 we also do this thing called Canary in the[br]Coal Mine deployments 0:37:36.769,0:37:38.999 which removes the fear, also 0:37:38.999,0:37:43.479 canary in the coalmine refers to a kind of[br]sad thing 0:37:43.479,0:37:46.869 about coal miners in the US 0:37:46.869,0:37:49.400 where they would send canaries into the mines 0:37:49.400,0:37:50.380 at various levels 0:37:50.380,0:37:54.170 and if the canary died they knew there was[br]a problem 0:37:54.170,0:37:58.299 with the air 0:37:58.299,0:37:59.470 but in the software world 0:37:59.470,0:38:02.839 what this means is you have bunch of servers[br]running 0:38:02.839,0:38:06.400 or a bunch of, I don't know, clients running[br]a certain version 0:38:06.400,0:38:09.789 and you start introducing new version incrementally 0:38:09.789,0:38:11.769 and watching the effects 0:38:11.769,0:38:13.210 so once you're measuring everything 0:38:13.210,0:38:14.680 and monitoring everything 0:38:14.680,0:38:17.039 you can also start doing these canary in the[br]coalmine things 0:38:17.039,0:38:19.069 where you say OK I have a new version of this[br]service 0:38:19.069,0:38:20.369 that I'm going to deploy 0:38:20.369,0:38:22.769 and I've got thirty servers running for it 0:38:22.769,0:38:25.869 but I'm going to change only five of them[br]now 0:38:25.869,0:38:27.950 and see, like, does my error rate increase 0:38:27.950,0:38:30.180 or does my performance drop on those servers 0:38:30.180,0:38:33.880 or do people actually not successfully complete[br]the task they're trying to do 0:38:33.880,0:38:34.650 on those servers 0:38:34.650,0:38:39.989 so, this also allows us the combination of[br]monitoring everything 0:38:39.989,0:38:41.989 and these immutable deployments and everything 0:38:41.989,0:38:46.569 gives us the ability to gradually affect change[br]and not be afraid 0:38:46.569,0:38:48.460 so we roll out changes all day every day 0:38:48.460,0:38:53.819 because we don't fear that we're just going[br]to destroy the entire system all at once 0:38:53.819,0:38:55.880 so I think I have like five minutes left 0:38:55.880,0:39:00.239 uh, these are some things we're not necessarily[br]doing yet 0:39:00.239,0:39:01.970 but they're some ideas that I have 0:39:01.970,0:39:04.940 that given some free time I will work on 0:39:04.940,0:39:08.700 and, they're probably more exciting 0:39:08.700,0:39:11.319 one is I talked about homeostatic regulation 0:39:11.319,0:39:13.579 and homeostasis 0:39:13.579,0:39:16.640 so I think we all understand the idea of you[br]know homeostasis 0:39:16.640,0:39:20.019 and the fact that systems have different parts[br]that do different roles 0:39:20.019,0:39:21.819 and can protect each other from each other 0:39:21.819,0:39:27.819 but, so this diagram is actually just some[br]random diagram 0:39:27.819,0:39:30.789 I copied and pasted off the AWS website 0:39:30.789,0:39:33.589 so it's not necessarily all that meaningful 0:39:33.589,0:39:36.279 except to show that every architecture 0:39:36.279,0:39:38.680 especially server based architectures 0:39:38.680,0:39:42.979 has a collection of services that play different[br]roles 0:39:42.979,0:39:45.079 and it almost looks like a person 0:39:45.079,0:39:46.989 you've got a brain and a heart and a liver 0:39:46.989,0:39:50.690 and all these things, right 0:39:50.690,0:39:53.009 what would it mean to actually implement 0:39:53.009,0:39:56.539 homeostatic regulation in a web service? 0:39:56.539,0:39:59.539 so that you have some controlling system 0:39:59.539,0:40:02.579 where the database will actually kill an app[br]server 0:40:02.579,0:40:04.859 that is hurting it, for example 0:40:04.859,0:40:07.410 just kill it 0:40:07.410,0:40:08.769 I don't know yet, I don't know what that is 0:40:08.769,0:40:14.339 but some ideas about this stuff 0:40:14.339,0:40:15.569 I don't know if you've heard of these 0:40:15.569,0:40:19.690 NetFlix, do you have NetFlix in India yet? 0:40:19.690,0:40:23.499 probably not, unless you have a VPN, right 0:40:23.499,0:40:27.029 NetFlix has a really great cloud based architecture 0:40:27.029,0:40:29.839 they have this thing called Chaos Monkey they've[br]created 0:40:29.839,0:40:33.940 which goes through their system and randomly[br]destroys Nodes 0:40:33.940,0:40:35.960 just crashes servers 0:40:35.960,0:40:39.680 and they did this because, when they were,[br]they were early users of AWS 0:40:39.680,0:40:42.239 and when they went out initially with AWS,[br]servers were crashing 0:40:42.239,0:40:43.880 like it was still immature 0:40:43.880,0:40:46.279 so they said OK we still want to use this 0:40:46.279,0:40:49.769 and we'll build in stuff so that we can deal[br]with the crashes 0:40:49.769,0:40:52.210 but we have to know it's gonna work when it[br]crashes 0:40:52.210,0:40:55.410 so let's make crashing be part of production 0:40:55.410,0:40:58.210 so they actually have gotten really sophisticated[br]now 0:40:58.210,0:41:00.499 and they will crash entire regions 0:41:00.499,0:41:01.819 cause they're in multiple data centers 0:41:01.819,0:41:03.569 so they'll say like, what would happen if[br]this 0:41:03.569,0:41:06.479 data center went down, does the site still[br]stay up? 0:41:06.479,0:41:08.369 and they do this in production all the time 0:41:08.369,0:41:09.609 like they're crashing servers right now 0:41:09.609,0:41:11.130 it's really neat 0:41:11.130,0:41:14.079 another one that is inspirational in this[br]way 0:41:14.079,0:41:19.170 is Pinterest, they use AWS as well 0:41:19.170,0:41:22.450 and they have, AWS has this thing called Spot[br]Instances 0:41:22.450,0:41:24.440 and I won't go into too much detail 0:41:24.440,0:41:25.910 because I don't have time 0:41:25.910,0:41:29.869 but Spot Instances allow you to effectively 0:41:29.869,0:41:36.210 bid on servers at a price that you are willing[br]to pay 0:41:36.210,0:41:39.559 so like if a usual server costs $0.20 per[br]minute 0:41:39.559,0:41:42.319 you can say, I'll give $0.15 per minute 0:41:42.319,0:41:45.380 and when excess capacity comes open 0:41:45.380,0:41:47.710 it's almost like a stock market 0:41:47.710,0:41:50.299 if $0.15 is the going price, you'll get a[br]server 0:41:50.299,0:41:52.479 and it starts up and it runs what you want 0:41:52.479,0:41:54.400 but here's the cool thing 0:41:54.400,0:42:00.140 if the stock market goes and the price goes[br]higher than you're willing to pay 0:42:00.140,0:42:03.229 Amazon will just turn off those servers 0:42:03.229,0:42:05.219 they're just dead, you don't have any warning 0:42:05.219,0:42:06.579 they're just dead 0:42:06.579,0:42:11.069 so Pinterest uses this for their production[br]servers 0:42:11.069,0:42:13.749 which means they save a lot of money 0:42:13.749,0:42:17.269 they're paying way under the average Amazon[br]cost for hosting 0:42:17.269,0:42:19.309 but the really cool thing in my opinion 0:42:19.309,0:42:21.170 is not the money they save but the fact that 0:42:21.170,0:42:26.039 like, what would you have to do to build a[br]full system 0:42:26.039,0:42:29.119 where any node can and will die at any moment 0:42:29.119,0:42:31.259 and it's not even under your control 0:42:31.259,0:42:33.529 that's really exciting 0:42:33.529,0:42:36.259 so a simple thing you can do for homeostasis[br]though 0:42:36.259,0:42:37.509 is you can just adjust 0:42:37.509,0:42:39.489 so in our world we have multiple nodes 0:42:39.489,0:42:40.969 and all these little services 0:42:40.969,0:42:42.569 we can scale each one independently 0:42:42.569,0:42:44.569 we're measuring everything 0:42:44.569,0:42:46.400 so Amazon has a thing called Auto Scaling 0:42:46.400,0:42:49.469 we don't use it, we do our own scaling 0:42:49.469,0:42:54.119 and we just do it based on volume and performance 0:42:54.119,0:42:57.869 now when you have a bunch of services like[br]this 0:42:57.869,0:43:00.539 like, I don't know, maybe we have fifty different[br]services now 0:43:00.539,0:43:03.229 that each play tiny little roles 0:43:03.229,0:43:07.210 it becomes difficult to figure out, like,[br]where things are 0:43:07.210,0:43:10.619 so we've started implementing zookeeper for[br]service resolution 0:43:10.619,0:43:14.130 which means a service can come online and[br]say 0:43:14.130,0:43:17.539 I'm the reminder service version 2.3 0:43:17.539,0:43:19.349 and then tell a central guardian 0:43:19.349,0:43:21.979 and the zookeeper can then route traffic to[br]it 0:43:21.979,0:43:24.019 probably too detailed for now 0:43:24.019,0:43:28.420 I'm gonna skip over some stuff real quick 0:43:28.420,0:43:29.499 but I want to talk about this one 0:43:29.499,0:43:33.739 if, did the Nordic Ruby, no, Nordic Ruby talks[br]never go online 0:43:33.739,0:43:35.160 so you can never see this talk 0:43:35.160,0:43:36.630 sorry 0:43:36.630,0:43:41.499 at Nordic Ruby Reginald Braithwaite did a[br]really cool talk 0:43:41.499,0:43:44.130 on like challenges of the Ruby language 0:43:44.130,0:43:45.380 and he made this statement 0:43:45.380,0:43:48.869 Ruby has beautiful but static coupling 0:43:48.869,0:43:51.269 which was really strange 0:43:51.269,0:43:52.989 but basically he was making the same point[br]that 0:43:52.989,0:43:53.950 I was talking about earlier 0:43:53.950,0:43:59.210 that, like Ruby creates a bunch of ways that[br]you can couple 0:43:59.210,0:44:01.200 your system together 0:44:01.200,0:44:02.729 that kind of screw you in the end 0:44:02.729,0:44:03.960 but they're really beautiful to use 0:44:03.960,0:44:09.819 but, like, Ruby can really lead to some deep[br]crazy coupling 0:44:09.819,0:44:14.089 and so he presented this idea of bind by contract 0:44:14.089,0:44:17.930 and bind by contract, in a Ruby sense 0:44:17.930,0:44:22.539 would be, like, I have a class that has a[br]method 0:44:22.539,0:44:26.410 that takes these parameters under these conditions 0:44:26.410,0:44:29.420 and I can kind of put it into my VM 0:44:29.420,0:44:31.999 and whenever someone needs to have a functionality[br]like that 0:44:31.999,0:44:34.650 it will be automatically bound together 0:44:34.650,0:44:36.589 by the fact that it can do that thing 0:44:36.589,0:44:40.680 and instead of how we tend to use Ruby and[br]Java and other languages 0:44:40.680,0:44:42.910 I have a class with a method name I'm going[br]to call it 0:44:42.910,0:44:45.319 right, that's coupling 0:44:45.319,0:44:48.009 but he proposed this idea of this decoupled[br]system 0:44:48.009,0:44:50.609 where you just say I need a functionality[br]like this 0:44:50.609,0:44:53.390 that works under the conditions that I have[br]present 0:44:53.390,0:44:55.369 so this lead me to this idea 0:44:55.369,0:44:59.059 and this may be like way too weird, I don't[br]know 0:44:59.059,0:45:02.569 what if in your web application your routes[br]file 0:45:02.569,0:45:08.130 for your services read like a functional pattern[br]matching syntax 0:45:08.130,0:45:11.200 so like if you've ever used Erlang or Haskell[br]or Scala 0:45:11.200,0:45:14.509 any of these things that have functional pattern[br]matching 0:45:14.509,0:45:18.680 what if you could then route to different[br]services 0:45:18.680,0:45:20.880 across a bunch of different services 0:45:20.880,0:45:23.450 based on contract 0:45:23.450,0:45:27.279 now I have zero time left 0:45:27.279,0:45:29.029 but I'm just gonna keep talking, cause I'm[br]mean 0:45:29.029,0:45:30.349 oh wait I'm not allowed to be mean 0:45:30.349,0:45:31.579 because of the code of contact 0:45:31.579,0:45:34.759 so I'll wrap up 0:45:34.759,0:45:38.749 so this is an idea that I've started working[br]on as well 0:45:38.749,0:45:40.539 where I would actually write an Erlang service 0:45:40.539,0:45:42.700 with this sort of functional pattern matching 0:45:42.700,0:45:45.589 but have it be routing in really fast real[br]time 0:45:45.589,0:45:48.539 through back end services that support it 0:45:48.539,0:45:50.650 one more thing I just want to show you real[br]quick 0:45:50.650,0:45:53.869 that I am working on and I want to show you 0:45:53.869,0:45:57.910 because I want you to help me 0:45:57.910,0:46:00.690 has anyone used JSON schema? 0:46:00.690,0:46:05.890 OK, you people are my friends for the rest[br]of the conference 0:46:05.890,0:46:08.469 in a system where you have all these things[br]talking to each other 0:46:08.469,0:46:11.219 you do need a way to validate the inputs and[br]outputs 0:46:11.219,0:46:16.229 but I don't want to generate code that parses[br]and creates JSON 0:46:16.229,0:46:21.180 I don't want to do something in real time[br]that intercepts my 0:46:21.180,0:46:24.219 kind of traffic, so there's this thing called[br]JSON schema 0:46:24.219,0:46:27.219 that allows you to, in a completely decoupled[br]way 0:46:27.219,0:46:30.719 specify JSON documents and how they should[br]interact 0:46:30.719,0:46:35.849 and I am working on a new thing that's called[br]Klagen 0:46:35.849,0:46:38.299 which is the German word for complain 0:46:38.299,0:46:42.420 it's written in Scala, so if anyone wants[br]to pair up on some Scala stuff 0:46:42.420,0:46:47.700 what it will be is a high performance asynchronous[br]JSON schema validation middleware 0:46:47.700,0:46:52.749 so if that's interesting to anyone, even if[br]you don't know Scala or JSON schema 0:46:52.749,0:46:54.029 please let me know 0:46:54.029,0:46:57.099 and I believe I'm out of time so I'm just[br]gonna end there 0:46:57.099,0:46:58.609 am I right? I'm right, yes 0:46:58.609,0:47:01.529 so thank you very much, and let's talk during[br]the conference