0:00:24.900,0:00:25.410


0:00:25.410,0:00:26.650
Chad: Yes, hello, thank you.

0:00:26.650,0:00:29.140
Audience member: Hello!

0:00:29.140,0:00:30.810
Chad: Hello!

0:00:30.810,0:00:33.629
I am Chad, as he said.

0:00:33.629,0:00:35.300
He said I need no introduction

0:00:35.300,0:00:37.890
so I won't introduce myself any further.

0:00:37.890,0:00:44.890
I may be the biggest non-Indian fan of India

0:01:01.690,0:01:06.400
[Hindi speech]

0:01:06.400,0:01:13.400


0:01:15.390,0:01:17.830
I'll now switch back, sorry.

0:01:17.830,0:01:20.010
If you don't understand Hindi, I said nothing[br]of value

0:01:20.010,0:01:22.030
and it was all wrong.

0:01:22.030,0:01:23.549
But I was saying that my Hindi is bad

0:01:23.549,0:01:26.000
and it's because now I'm learning German

0:01:26.000,0:01:28.110
so I mixed them together, but I know not everyone

0:01:28.110,0:01:29.440
speaks Hindi here.

0:01:29.440,0:01:32.480
I just had to show off, you know

0:01:32.480,0:01:37.110
So, I am currently working on 6WunderKinder,

0:01:37.110,0:01:40.370
and I'm working on a product called Wunderlist.

0:01:40.370,0:01:42.060
It is a productivity application.

0:01:42.060,0:01:45.860
It runs on every client you can think of.

0:01:45.860,0:01:47.620
We have native clients, we have a back-end,

0:01:47.620,0:01:49.690
we have millions of active users,

0:01:49.690,0:01:51.850
and I'm telling you this not so that you'll[br]go download it -

0:01:51.850,0:01:53.390
you can do that too -

0:01:53.390,0:01:56.960
but I want to tell you about the challenges[br]that I have

0:01:56.960,0:02:00.710
and the way I'm starting to think about system's[br]architecture and design.

0:02:00.710,0:02:03.130
That's what I'm gonna talk about today

0:02:03.130,0:02:05.690
I'm going to show you some things that are[br]real

0:02:05.690,0:02:07.159
and that we're really doing.

0:02:07.159,0:02:09.429
I'm going to show you some things that are

0:02:09.429,0:02:12.670
just a fantasy that maybe don't make any sense[br]at all.

0:02:12.670,0:02:13.980
But hopefully I'll get you think about

0:02:13.980,0:02:15.540
how we think about system architecture

0:02:15.540,0:02:18.370
and how we build things that can last for[br]a long time.

0:02:18.370,0:02:20.870
So the first thing that I want to mention:

0:02:20.870,0:02:23.340
this is a graph from the Standish Chaos report

0:02:23.340,0:02:25.430
and I've taken the years out

0:02:25.430,0:02:27.310
and I've taken some of the raw data out

0:02:27.310,0:02:28.720
because it doesn't matter.

0:02:28.720,0:02:30.530
If you look at these, this graph,

0:02:30.530,0:02:33.379
each one of these bars is a year,

0:02:33.379,0:02:38.159
and each bar represents successful projects[br]in green -

0:02:38.159,0:02:40.079
software projects.

0:02:40.079,0:02:42.409
Challenged projects are in silver or white[br]in the middle

0:02:42.409,0:02:44.249
and then failed ones are in red.

0:02:44.249,0:02:47.340
But challenged means significantly over time[br]or budget

0:02:47.340,0:02:49.349
which to me means failed too.

0:02:49.349,0:02:51.430
So basically we're terrible,

0:02:51.430,0:02:54.279
all of us here, we're terrible.

0:02:54.279,0:02:57.060
We call ourselves engineers but it's a disgrace.

0:02:57.060,0:03:00.840
We very rarely actually launch things that[br]work.

0:03:00.840,0:03:01.389
Kind of sad,

0:03:01.389,0:03:03.829
and I am here to bring you down.

0:03:03.829,0:03:07.169
Then once you launch software, anecdotal-y,

0:03:07.169,0:03:12.359
and you probably would see this in your own[br]work lives, too,

0:03:12.359,0:03:16.230
anecdotal-y, software gets killed after about[br]five years -

0:03:16.230,0:03:17.650
business software.

0:03:17.650,0:03:19.950
So you barely ever get to launch it, because,

0:03:19.950,0:03:23.319
or at least successfully, in a way that you're[br]proud of,

0:03:23.319,0:03:24.650
and then in about five years

0:03:24.650,0:03:27.739
you end up in that situation where you're[br]doing a big rewrite

0:03:27.739,0:03:29.519
and throwing everything away and replacing[br]it.

0:03:29.519,0:03:32.519
You know there's always that project to get[br]rid of the junk,

0:03:32.519,0:03:35.569
old Java code or whatever that you wrote five[br]years ago,

0:03:35.569,0:03:37.180
replace it with Ruby now,

0:03:37.180,0:03:39.909
five years from now you'll be replacing your[br]old junk Ruby code

0:03:39.909,0:03:46.180
that didn't work with something else.

0:03:46.180,0:03:49.379
We create this thing, probably all of you[br]know the term legacy software -

0:03:49.379,0:03:53.340
Right, am I right? You know what legacy software[br]is,

0:03:53.340,0:03:56.139
and you probably think of it as a negative[br]thing.

0:03:56.139,0:03:58.120
You think of it as that ugly code that doesn't[br]work,

0:03:58.120,0:04:02.540
that's brittle, that you can't change, that[br]you're all afraid of.

0:04:02.540,0:04:07.150
But there's actually also a positive connotation[br]of the word legacy:

0:04:07.150,0:04:14.139
it's leaving behind something that future[br]generations can benefit from.

0:04:14.139,0:04:17.370
But if we're rarely ever launching successful[br]projects

0:04:17.370,0:04:20.889
and then the ones we do launch tend to die[br]within five years

0:04:20.889,0:04:24.600
none of us are actually creating a legacy[br]in our work.

0:04:24.600,0:04:27.430
We're just creating stuff that gets thrown[br]away.

0:04:27.430,0:04:29.400
Kind of sad.

0:04:29.400,0:04:32.240
So we create this stuff that's a legacy software.

0:04:32.240,0:04:35.060
It's hard to change, that's why it ends up[br]getting thrown away

0:04:35.060,0:04:37.370
right, that's, if the software worked

0:04:37.370,0:04:40.030
and you could keep changing it to meet the[br]needs of the business

0:04:40.030,0:04:43.979
you wouldn't need to do a big rewrite and[br]throw it away.

0:04:43.979,0:04:47.840
We create these huge tightly-coupled systems,

0:04:47.840,0:04:49.370
and I don't just mean one application,

0:04:49.370,0:04:51.430
but like many applications are all tightly[br]coupled.

0:04:51.430,0:04:55.900
You've got this thing over here talking to[br]the database of this system over here

0:04:55.900,0:04:59.360
so if you change the columns to update the[br]view of a webpage

0:04:59.360,0:05:02.710
you ruin your billing system, that kind of[br]thing

0:05:02.710,0:05:06.270
this is what makes it so hard to change

0:05:06.270,0:05:09.970
and the sad thing about this is the way we[br]work

0:05:09.970,0:05:13.500
the way we develop software, this is the default[br]setting

0:05:13.500,0:05:18.460
and, what I mean is, if we were robots churning[br]out software

0:05:18.460,0:05:20.819
and we had a preferences panel

0:05:20.819,0:05:25.080
the default preferences would lead to us creating[br]terrible software that gets thrown away in

0:05:25.080,0:05:25.699
five years

0:05:25.699,0:05:27.210
that's just how we all work

0:05:27.210,0:05:30.180
as human beings when we sit down to write[br]code

0:05:30.180,0:05:35.430
our default instincts lead to us to create[br]systems that are tightly coupled

0:05:35.430,0:05:41.659
and hard to change and ultimately get thrown[br]away and can't scale

0:05:41.659,0:05:46.060
we create, we try doing tests, we try doing[br]TDD

0:05:46.060,0:05:51.330
but we create test suites that take forty-five[br]minutes to run

0:05:51.330,0:05:52.720
every team has had to deal with this I'm sure

0:05:52.720,0:05:55.990
if you've written any kind of meaningful application

0:05:55.990,0:05:57.970
and it gets to where you have like a project

0:05:57.970,0:05:59.849
to speed up the test suite

0:05:59.849,0:06:02.949
like you start focusing your company's resources

0:06:02.949,0:06:04.949
on making the test suite faster

0:06:04.949,0:06:08.689
or making it like only fail ninety percent[br]of the time

0:06:08.689,0:06:10.949
and then you say well if it only fails ninety[br]percent that's OK

0:06:10.949,0:06:14.550
right, and right now it's taking forty-five[br]minutes

0:06:14.550,0:06:18.180
we want to get it to where it only takes ten[br]minutes to run

0:06:18.180,0:06:24.479
so the test suite ends up being a liability[br]instead of a benefit

0:06:24.479,0:06:25.719
because of the way you do it

0:06:25.719,0:06:29.319
because you have this architect where everything[br]is so coupled

0:06:29.319,0:06:34.939
you can't change anything without spending[br]hours working on the stupid test suite

0:06:34.939,0:06:38.419
and your terrified to deploy

0:06:38.419,0:06:42.960
I know like the last big Java project I was[br]working on

0:06:42.960,0:06:45.819
it would take, once a week we did a deploy

0:06:45.819,0:06:50.139
it would take fifteen people all night to[br]deploy the thing

0:06:50.139,0:06:52.460
and usually it was like copying class files[br]around

0:06:52.460,0:06:54.430
and restarting servers

0:06:54.430,0:06:57.120
it's much better today but it's still terrifying

0:06:57.120,0:06:59.340
you deploy code, you change it in production

0:06:59.340,0:07:01.259
you're not sure what might break

0:07:01.259,0:07:03.719
cause it's really hard to test these big integrated[br]things together

0:07:03.719,0:07:08.650
and actually upgrading the technology component[br]is terrifying

0:07:08.650,0:07:13.289
so, how many of you have been doing Rails[br]for more than three years?

0:07:13.289,0:07:18.400
do you have, like a Rails 2 app in production,[br]anyone? Yeah?

0:07:18.400,0:07:21.710
that's a lot of people, wow, that's terrifying

0:07:21.710,0:07:26.129
and I've been in situations, recently, where[br]we had Rails 2 apps in production

0:07:26.129,0:07:29.560
security patches are coming out, we were applying[br]our own versions

0:07:29.560,0:07:30.879
of those security patches

0:07:30.879,0:07:32.439
because we were afraid to upgrade Rails

0:07:32.439,0:07:35.060
we would rather hack it than upgrade the thing

0:07:35.060,0:07:38.319
because you just don't know what's gonna happen

0:07:38.319,0:07:42.490
and then you end up, as you're re-implementing[br]all this stuff yourself

0:07:42.490,0:07:44.819
you end up burning yourself out, wasting your[br]time

0:07:44.819,0:07:47.990
because you're hacking on stupid Rails 2

0:07:47.990,0:07:50.139
or some old struts version

0:07:50.139,0:07:52.569
when you should be just taking advantage of[br]the new patches

0:07:52.569,0:07:54.639
but you can't because you're afraid to upgrade[br]the software

0:07:54.639,0:07:56.240
because you don't know what's going to happen

0:07:56.240,0:08:02.849
because the system is too big and too scary

0:08:02.849,0:08:04.949
then, and this is really bad, I think this[br]is something

0:08:04.949,0:08:07.009
Ruby messes up for all of us

0:08:07.009,0:08:11.110
I say this as someone who's been using Ruby[br]for thirteen years now

0:08:11.110,0:08:12.740
happily

0:08:12.740,0:08:15.520
we create these mountains of abstractions

0:08:15.520,0:08:17.669
and the logic ends up being buried inside[br]them

0:08:17.669,0:08:22.860
I mean in Java it was like static, or, you[br]know, factories

0:08:22.860,0:08:25.090
and design pattern soup

0:08:25.090,0:08:27.490
in Ruby its modules and mixins and you know

0:08:27.490,0:08:31.050
we have all these crazy ways of hiding what's[br]actually happening from us

0:08:31.050,0:08:33.070
but when you go look at the code

0:08:33.070,0:08:34.360
it's completely opaque

0:08:34.360,0:08:37.090
you have no idea where the stuff actually[br]gets done

0:08:37.090,0:08:40.820
because it's in some magic library somewhere

0:08:40.820,0:08:45.050
and we do all that because we're trying to[br]save ourselves from the complexity of these

0:08:45.050,0:08:47.450
big nasty systems

0:08:47.450,0:08:50.760
but like if you look at the rest of the world

0:08:50.760,0:08:53.810
this is a software specific problem

0:08:53.810,0:08:58.760
these cars are old, they're older than any[br]software that you would ever run

0:08:58.760,0:09:00.340
and they're still driving down the street

0:09:00.340,0:09:03.460
they're older than software itself, right

0:09:03.460,0:09:06.370
but these things still function, they still[br]work

0:09:06.370,0:09:08.970
how? why? why do they work?

0:09:08.970,0:09:11.340
bodies! my body should not work

0:09:11.340,0:09:12.540
I have abused it

0:09:12.540,0:09:13.870
I should not be standing here today

0:09:13.870,0:09:16.660
I shouldn't have been able to come from Berlin[br]here

0:09:16.660,0:09:18.620
without dying somehow by being in the air

0:09:18.620,0:09:23.660
you know, by the air pressure changes

0:09:23.660,0:09:25.950
but our bodies somehow can survive even when

0:09:25.950,0:09:30.730
we don't take care of them

0:09:30.730,0:09:35.290
and like it's just the system that works,[br]right

0:09:35.290,0:09:37.770
so how do our bodies work?

0:09:37.770,0:09:39.440
how do we stay alive

0:09:39.440,0:09:40.930
despite this fact

0:09:40.930,0:09:42.130
even though we haven't done like some

0:09:42.130,0:09:45.270
great design, we don't have any design patterns

0:09:45.270,0:09:49.780
like mixed up into our bodies

0:09:49.780,0:09:53.980
in biology there is a term called homeostasis

0:09:53.980,0:09:56.210
and I literally don't know what this means

0:09:56.210,0:09:57.390
other than this definition

0:09:57.390,0:09:58.870
so you won't learn about this from me

0:09:58.870,0:10:01.060
there's probably at least one biologist in[br]the room

0:10:01.060,0:10:04.370
so you can correct me later

0:10:04.370,0:10:07.870
but basically the idea of homeostasis is

0:10:07.870,0:10:11.430
that an organism has all these different components

0:10:11.430,0:10:13.890
that serve different purposes

0:10:13.890,0:10:15.820
that regulate it

0:10:15.820,0:10:18.260
so they're all kind of in balance

0:10:18.260,0:10:20.750
and they work together to regulate the system

0:10:20.750,0:10:23.700
if one component, like a liver, does too much

0:10:23.700,0:10:24.720
or does the wrong thing

0:10:24.720,0:10:27.840
another component kicks in and fixes it

0:10:27.840,0:10:30.160
and so our bodies are this well designed system

0:10:30.160,0:10:31.960
for staying alive

0:10:31.960,0:10:34.530
because we have almost like autonomous agents

0:10:34.530,0:10:38.810
internally that take care of the many things[br]that can and do go wrong

0:10:38.810,0:10:41.890
on a regular basis

0:10:41.890,0:10:43.770
so you have, you know, your brain, your liver

0:10:43.770,0:10:47.230
your liver, of course, metabolizes toxic substances

0:10:47.230,0:10:50.400
your kidney deals with blood, water level,[br]et cetera

0:10:50.400,0:10:55.660
you know all these things work in concert[br]to make you live

0:10:55.660,0:11:01.140
the inability to continue to do that is known[br]as homeostatic imbalance

0:11:01.140,0:11:04.070
so I was saying, homeostasis is balancing

0:11:04.070,0:11:07.330
not being able to do that is when you're out[br]of balance

0:11:07.330,0:11:10.340
and that will actually lead to really bad[br]health problems

0:11:10.340,0:11:16.410
or probably death, if you fall into homeostatic[br]imbalance

0:11:16.410,0:11:20.420
so the good news is you're already dying

0:11:20.420,0:11:22.450
like we're all dying all the time

0:11:22.450,0:11:26.500
this is the beautiful thing about death

0:11:26.500,0:11:29.110
there is, there is an estimate that fifty[br]trillion cells

0:11:29.110,0:11:31.850
are in your body, and three million die per[br]second

0:11:31.850,0:11:35.520
it's an estimate because it's actually impossible[br]to count

0:11:35.520,0:11:39.520
but scientists have figured out somehow that[br]this is probably the right number

0:11:39.520,0:11:42.310
so your cells, you've probably heard this[br]all your life

0:11:42.310,0:11:45.170
like physically, after some amount of time,

0:11:45.170,0:11:47.430
you aren't the same human being that you were,[br]physically

0:11:47.430,0:11:52.770
you know, I don't know, you some period of[br]time ago

0:11:52.770,0:11:55.500
you're literally not the same organism anymore

0:11:55.500,0:11:58.420
but you're the same system

0:11:58.420,0:12:01.470
kind of interesting, isn't it

0:12:01.470,0:12:06.740
so in a way you can think about software this

0:12:06.740,0:12:08.300
you can think about software as a system

0:12:08.300,0:12:10.820
if the components could be replaced like these[br]cells

0:12:10.820,0:12:17.820
like, if you focus on making death, constant[br]death OK

0:12:18.970,0:12:19.890
on a small level

0:12:19.890,0:12:24.690
then the system can live on a large level

0:12:24.690,0:12:25.760
that's what this talk is about

0:12:25.760,0:12:29.300
solution, the solution being to mimic living[br]organisms

0:12:29.300,0:12:36.110
and as an aside, I will say many times the[br]word small or tiny in this talk

0:12:36.110,0:12:38.480
because I think I'm learning, as I age

0:12:38.480,0:12:39.870
that small is good

0:12:39.870,0:12:42.950
its, small projects are good

0:12:42.950,0:12:44.050
you know how to estimate them

0:12:44.050,0:12:45.110
small commitments are good

0:12:45.110,0:12:46.790
because you know you can make them

0:12:46.790,0:12:47.750
small methods are good

0:12:47.750,0:12:48.790
small classes are good

0:12:48.790,0:12:50.140
small applications are good

0:12:50.140,0:12:52.410
small teams are good

0:12:52.410,0:12:55.270
so I don't know, this is sort of a non sequitur

0:12:55.270,0:12:58.130
so if we're going to think about software

0:12:58.130,0:12:59.750
as like an organism

0:12:59.750,0:13:03.100
what is a cell in that context?

0:13:03.100,0:13:06.360
this is sort of the key question that you[br]have to ask yourself

0:13:06.360,0:13:08.940
and I say that a cell is a tiny component

0:13:08.940,0:13:12.800
now, tiny and component are both subjective[br]words

0:13:12.800,0:13:15.370
so you can kind of do what you want with that

0:13:15.370,0:13:17.670
but it's a good frame of thinking

0:13:17.670,0:13:20.530
if you make your software system of tiny components

0:13:20.530,0:13:22.510
each one can be like a cell

0:13:22.510,0:13:28.010
each one can die and the system is a collection[br]of those tiny components

0:13:28.010,0:13:31.930
and what you want is not for your code to[br]live forever

0:13:31.930,0:13:35.700
you don't care that each line of code lives[br]forever, right

0:13:35.700,0:13:38.830
like if you're trying to develop a legacy[br]in software

0:13:38.830,0:13:42.920
it's not important to you that your system[br]dot out dot printline statement

0:13:42.920,0:13:44.300
lives for ten years

0:13:44.300,0:13:48.050
it's important to you that the function of[br]the system lives for ten years

0:13:48.050,0:13:50.170
so like, about exactly ten years ago

0:13:50.170,0:13:57.170
we created Ruby gems at the RubyConf 2003[br]in Austin, Texas

0:13:59.260,0:14:03.600
I haven't touched Ruby gems myself in like[br]four or five years

0:14:03.600,0:14:04.890
but people are still using it

0:14:04.890,0:14:06.130
they hate it because it's software

0:14:06.130,0:14:07.750
everybody hates software right

0:14:07.750,0:14:10.160
so if you can create software that people[br]hate

0:14:10.160,0:14:13.080
you've succeeded

0:14:13.080,0:14:14.450
but it still exists

0:14:14.450,0:14:16.560
I have no idea if any of the code is the same

0:14:16.560,0:14:17.210
I would assume not

0:14:17.210,0:14:21.350
you know I think, I'm sure that my name is[br]still in it in a copyright notice

0:14:21.350,0:14:23.510
but that's about it

0:14:23.510,0:14:24.890
and that's a beautiful thing

0:14:24.890,0:14:28.380
people are still using it to install Ruby[br]libraries

0:14:28.380,0:14:29.570
and software

0:14:29.570,0:14:35.600
and I don't care if any of my existing, or[br]my initial code is still in the system

0:14:35.600,0:14:36.840
because the system still lives

0:14:36.840,0:14:43.030
so, quite a long time ago now I was researching[br]this kind of question

0:14:43.030,0:14:44.600
about Legacy software

0:14:44.600,0:14:48.390
and I asked a question on Twitter as I often[br]do at conferences

0:14:48.390,0:14:49.910
when I'm preparing

0:14:49.910,0:14:55.610
what are some of the old surviving software[br]systems you regularly use

0:14:55.610,0:14:58.430
and if you look at this, I mean, one thing[br]is obviously

0:14:58.430,0:15:03.290
everyone who answered gave some sort of Unix[br]related answer

0:15:03.290,0:15:06.510
but basically all of these things on this[br]list

0:15:06.510,0:15:13.240
are either systems that are collections of[br]really well-known split-up components

0:15:13.240,0:15:15.700
or they're tiny, tiny programs

0:15:15.700,0:15:18.540
so, like, grep is a tiny program, make

0:15:18.540,0:15:19.640
it only does one thing

0:15:19.640,0:15:23.970
well make is actually also arguably an operating[br]system

0:15:23.970,0:15:27.320
but I won't get into that

0:15:27.320,0:15:29.390
emacs is obviously an operating system, right

0:15:29.390,0:15:33.050
but it's well designed of these tiny little[br]pieces

0:15:33.050,0:15:37.190
so a lot of the old systems I know about follow[br]this pattern

0:15:37.190,0:15:40.190
this metaphor that I'm proposing

0:15:40.190,0:15:42.170
and from my own career

0:15:42.170,0:15:43.530
when I was here before in Banglore

0:15:43.530,0:15:47.250
I worked for GE and some of the people

0:15:47.250,0:15:48.690
we hired even worked on the system there

0:15:48.690,0:15:50.970
we had a system called the Bull

0:15:50.970,0:15:53.700
and it was a Honeywell Bull mainframe

0:15:53.700,0:15:57.280
I doubt any of you have worked on that

0:15:57.280,0:15:58.440
but this one I know you didn't work on

0:15:58.440,0:16:01.070
because it had a custom operating system

0:16:01.070,0:16:03.110
with our own RDVMS

0:16:03.110,0:16:06.260
we had created a PCP stack for it

0:16:06.260,0:16:11.160
using like custom hardware that we plugged[br]into a Windows MT computer

0:16:11.160,0:16:14.930
with some sort of MT queuing system back in[br]the day

0:16:14.930,0:16:17.060
it was this terrifying thing

0:16:17.060,0:16:22.510
when I started working there the system was[br]already something like twenty-five years old

0:16:22.510,0:16:25.630
and I believe even though there have been[br]many, many projects

0:16:25.630,0:16:30.160
to try to kill it, like we had a team called[br]the Bull exit team

0:16:30.160,0:16:33.230
I believe the system is still in production

0:16:33.230,0:16:37.070
not as much as it used to be, there are less[br]and less functions in production

0:16:37.070,0:16:39.190
but I believe the system is still in production

0:16:39.190,0:16:46.190
the reason for this is that the system was[br]actually made up of these tiny little components

0:16:47.070,0:16:50.540
and like really queer interfaces between them

0:16:50.540,0:16:53.950
and we kept the system live because every[br]time we tried to replace it

0:16:53.950,0:16:57.290
with some fancy new gem, web thing or gooey[br]app

0:16:57.290,0:16:59.470
it wasn't as good, and the users hated it

0:16:59.470,0:17:00.740
it just didn't work

0:17:00.740,0:17:04.789
so we had to use this old, crazy, modified[br]mainframe

0:17:04.789,0:17:08.150
for a long time as a result

0:17:08.150,0:17:10.890
so, the question I ask myself is now

0:17:10.890,0:17:13.429
how do I, how do I approach a problem like[br]this

0:17:13.429,0:17:19.000
and build a system that can survive for a[br]long time

0:17:19.000,0:17:20.049
I would encourage you

0:17:20.049,0:17:22.589
how many of you know of Fred George

0:17:22.589,0:17:24.720
this is Fred George

0:17:24.720,0:17:25.900
he was at ThoughtWorks for awhile

0:17:25.900,0:17:27.669
so he may have, I think he lived in Banglore

0:17:27.669,0:17:31.150
for some time with ThoughtWorks, in fact

0:17:31.150,0:17:35.050
he is now running a start-up in Silicon Valley

0:17:35.050,0:17:38.600
but he has this talk that you can watch online

0:17:38.600,0:17:41.660
from the Barcelona Ruby Conference the year[br]before last

0:17:41.660,0:17:45.030
called Microservice Architectures

0:17:45.030,0:17:47.760
and he talks in great detail about he,

0:17:47.760,0:17:50.340
how he implemented a concept at forward

0:17:50.340,0:17:52.150
that's very much like what I'm talking about

0:17:52.150,0:17:55.130
tiny components that only do one thing and[br]can be thrown away

0:17:55.130,0:17:59.890
so Microservice Architecture is kind of the[br]core of what I'm gonna talk about

0:17:59.890,0:18:02.080
now I've put together some rules for 6WunderKinder

0:18:02.080,0:18:04.110
which I am going to share with you

0:18:04.110,0:18:07.110
6WunderKinder is the company I work for

0:18:07.110,0:18:09.220
when we're working on Wunderlist

0:18:09.220,0:18:11.940
and the rules of the, the goals of these rules

0:18:11.940,0:18:16.690
are to reduce coupling, to make it where we[br]can do fear-free deployments

0:18:16.690,0:18:19.190
we reduce the chance of "cruft" in our code

0:18:19.190,0:18:20.680
like nasty stuff that you're afraid of

0:18:20.680,0:18:24.670
that you leave there, kind of broken window[br]problems

0:18:24.670,0:18:28.660
we make it literally trivial to change code

0:18:28.660,0:18:32.680
so you just never have to ask how do I do[br]that

0:18:32.680,0:18:33.990
you just find it easy

0:18:33.990,0:18:39.140
and most importantly we give ourselves the[br]freedom to go fast

0:18:39.140,0:18:43.680
because I think no developer ever wants to[br]be slow

0:18:43.680,0:18:44.670
that's one of the worst things

0:18:44.670,0:18:47.750
just toiling away and not actually accomplishing[br]anything

0:18:47.750,0:18:50.920
but we go slow because we're constrained by[br]the system

0:18:50.920,0:18:54.110
and we're constrained by, sometimes projects

0:18:54.110,0:18:56.010
and other, you know, management related things

0:18:56.010,0:19:01.230
but often times its the mess of the system[br]that we've created

0:19:01.230,0:19:03.730
so some of the rules

0:19:03.730,0:19:09.480
I think one thing, and maybe, maybe I'm going[br]to get some push back from this crowd

0:19:09.480,0:19:13.170
one rule that is less controversial than it[br]used to be

0:19:13.170,0:19:15.050
is that comments are a design smell

0:19:15.050,0:19:19.270
does anyone strongly disagree with that?

0:19:19.270,0:19:20.750
no?

0:19:20.750,0:19:23.930
does anyone strongly agree with that?

0:19:23.930,0:19:27.210
OK, so the rest of you have no idea what I'm[br]talking about

0:19:27.210,0:19:33.240
so a design smell, I want to define this really[br]quickly

0:19:33.240,0:19:36.860
a design smell is something you see in your[br]code or your system

0:19:36.860,0:19:39.600
where it doesn't necessarily mean it's bad

0:19:39.600,0:19:40.730
but you look at it and you think

0:19:40.730,0:19:43.270
hmm, I should look into this a little bit

0:19:43.270,0:19:45.930
and ask myself, why are there so many comments[br]in this code?

0:19:45.930,0:19:48.300
you know, especially the bottom one

0:19:48.300,0:19:50.650
inline comments?

0:19:50.650,0:19:56.810
definitely bad, definitely a sign that you[br]should have another method, right

0:19:56.810,0:19:59.040
so it's pretty easy to convince people

0:19:59.040,0:20:00.430
that comments are a design smell

0:20:00.430,0:20:02.010
and I think a lot of people in the industry

0:20:02.010,0:20:03.490
are starting to agree

0:20:03.490,0:20:05.280
maybe not for like a public library

0:20:05.280,0:20:06.800
where you really need to tell someone

0:20:06.800,0:20:09.660
here's how you use this class and this is[br]what it's for

0:20:09.660,0:20:12.470
but you shouldn't have to document every method

0:20:12.470,0:20:15.440
and every argument because the method name[br]and the argument name

0:20:15.440,0:20:18.170
should speak for themselves, right

0:20:18.170,0:20:20.670
so here's one that you probably won't agree[br]with

0:20:20.670,0:20:21.940
tests are a design smell

0:20:21.940,0:20:28.580
so this one is probably a little more controversial

0:20:28.580,0:20:32.570
especially in an environment where you're[br]maybe still struggling people

0:20:32.570,0:20:37.910
struggling with people to actually get them[br]to write tests to begin with, right

0:20:37.910,0:20:41.330
you know I went through this period in, like,[br]2000 and 2001

0:20:41.330,0:20:44.090
where I was really heavily into evangelizing[br]TDD

0:20:44.090,0:20:47.190
and it was really stressful that you couldn't[br]get anyone to do it

0:20:47.190,0:20:49.520
I think you do have to go through that period

0:20:49.520,0:20:52.110
and I'm not saying you shouldn't write any[br]tests

0:20:52.110,0:20:57.180
but that picture I showed you earlier of the[br]slow, brittle test suite

0:20:57.180,0:20:58.300
that's bad, right

0:20:58.300,0:21:00.910
that is a bad state to be in

0:21:00.910,0:21:03.890
and you're in that state because your tests[br]suck

0:21:03.890,0:21:05.850
that's why you get in that state

0:21:05.850,0:21:09.570
your tests suck because you're writing bad[br]tests

0:21:09.570,0:21:15.910
that don't exercise the right things in your[br]system

0:21:15.910,0:21:18.960
and what I've found is whenever I look into[br]one of these

0:21:18.960,0:21:21.809
big slow brittle test suites

0:21:21.809,0:21:25.180
the tests themselves are indications

0:21:25.180,0:21:28.240
and the sheer proliferation of tests

0:21:28.240,0:21:30.940
are indications that the system is bad

0:21:30.940,0:21:33.590
and the developers are like desperately

0:21:33.590,0:21:36.980
fearfully trying to run the code

0:21:36.980,0:21:38.650
in every way they can

0:21:38.650,0:21:40.660
because it's the only way they can manage

0:21:40.660,0:21:43.980
to even think about the complexity

0:21:43.980,0:21:47.720
but if you think about it, if you had a tiny[br]trivial system

0:21:47.720,0:21:50.059
you wouldn't need to have hundreds of test[br]files

0:21:50.059,0:21:53.059
that take ten minutes to run, ever

0:21:53.059,0:21:54.480
if you did, you're doing something stupid

0:21:54.480,0:21:57.020
you're wasting your time working on tests

0:21:57.020,0:22:00.110
and we as software developers obsess about[br]this kind of thing

0:22:00.110,0:22:04.770
because we have to fight so hard to get our[br]peers to do it in the first place

0:22:04.770,0:22:06.370
and to understand it

0:22:06.370,0:22:10.050
we obsess to the point where we focus on the[br]wrong thing

0:22:10.050,0:22:14.660
none of us are in the business of writing[br]tests for customers

0:22:14.660,0:22:17.559
like we're not launching our tests on the[br]web

0:22:17.559,0:22:20.000
and hoping people will buy them, right

0:22:20.000,0:22:23.900
it doesn't provide value, it's just a side-effect

0:22:23.900,0:22:25.780
that we have focused too heavily on

0:22:25.780,0:22:29.930
and we've lost sight of what the actual goal[br]is

0:22:29.930,0:22:34.210
so, this one actually requires a visual

0:22:34.210,0:22:37.100
I tell the people on my team now

0:22:37.100,0:22:40.340
you can write code in any language you want

0:22:40.340,0:22:42.760
any framework you want, anything you want[br]to do

0:22:42.760,0:22:44.559
as long as the code is this big

0:22:44.559,0:22:47.490
so if you want to write the new service in[br]Haskell

0:22:47.490,0:22:50.059
and it's this big in a normal size font

0:22:50.059,0:22:51.470
you can do it

0:22:51.470,0:22:54.260
if you want to do it in Closure or Elixir[br]or Scarla or Ruby

0:22:54.260,0:22:55.050
or whatever you want to do

0:22:55.050,0:22:56.820
even Python for god's sake

0:22:56.820,0:22:59.230
you can do it if it's this big and no bigger

0:22:59.230,0:23:04.010
why? because it means I can look at it

0:23:04.010,0:23:05.620
and I can understand it

0:23:05.620,0:23:08.730
or if I don't I'll just throw it away

0:23:08.730,0:23:12.100
because if it's this big it doesn't do very[br]much, right

0:23:12.100,0:23:14.450
so the risk is really low

0:23:14.450,0:23:16.809
and I really mean the system is that

0:23:16.809,0:23:19.130
there are the, the component is that big

0:23:19.130,0:23:21.070
and in my world a component means a service

0:23:21.070,0:23:24.710
that's running and probably listening on an[br]HTTP board

0:23:24.710,0:23:27.820
or some sort of rift or RPC protocol

0:23:27.820,0:23:29.520
so it's a standalone thing

0:23:29.520,0:23:30.680
it's its own application

0:23:30.680,0:23:33.130
it's probably in its own git repository

0:23:33.130,0:23:34.950
people do poll requests against it

0:23:34.950,0:23:35.820
but it's just tiny

0:23:35.820,0:23:39.110
so this big

0:23:39.110,0:23:41.200
at the top of this, by the way

0:23:41.200,0:23:45.720
is some code by Konstantin Haase

0:23:45.720,0:23:48.720
who also lives in Berlin, where I live

0:23:48.720,0:23:51.480
this is a rewrite of Sinatra

0:23:51.480,0:23:52.430
the web framework

0:23:52.430,0:23:55.450
and Konstantin is actually the maintainer[br]of Sinatra

0:23:55.450,0:23:58.870
it's not fully compatible, but it's amazingly[br]close

0:23:58.870,0:24:00.260
and it all fits right in that

0:24:00.260,0:24:05.020
but the font size is kind of small, so I cheated

0:24:05.020,0:24:08.550
another rule, our systems are heterogeneous[br]by default

0:24:08.550,0:24:11.420
so I say you can write in any language you[br]want

0:24:11.420,0:24:14.050
that's not just because I want the developers[br]to be excited

0:24:14.050,0:24:16.650
although I think, most of you, if you worked

0:24:16.650,0:24:19.390
in an environment where your boss told you

0:24:19.390,0:24:21.500
you can use any programming language or tool[br]you want

0:24:21.500,0:24:23.809
you would be pretty happy about that, right

0:24:23.809,0:24:26.590
anyone unhappy about that? I don't think so

0:24:26.590,0:24:28.100
unless it's one of the bosses here

0:24:28.100,0:24:31.679
that's like don't tell people that

0:24:31.679,0:24:32.570
so that's one thing

0:24:32.570,0:24:36.880
the other one is, it leads to a good system[br]design

0:24:36.880,0:24:38.840
because think about this

0:24:38.840,0:24:42.350
if I write one program in Erlang, one component[br]in Erlang

0:24:42.350,0:24:44.410
one program in Ruby

0:24:44.410,0:24:47.710
I have to work really, really hard to make[br]tight coupling

0:24:47.710,0:24:49.650
between those things

0:24:49.650,0:24:53.340
like I have to basically use computer science[br]to do that

0:24:53.340,0:24:54.370
I don't even know what I would do

0:24:54.370,0:24:55.929
you know it's hard

0:24:55.929,0:24:58.590
like I would have to maybe implement Ruby[br]in Erlang

0:24:58.590,0:25:01.140
so that it can run in the same BM or vice[br]versa

0:25:01.140,0:25:04.059
it's just silly, I wouldn't do it

0:25:04.059,0:25:07.050
so if my system is heterogeneous by default

0:25:07.050,0:25:11.960
my coupling is very low, at least at a certain[br]level by default

0:25:11.960,0:25:14.170
because it's the path of least resistance

0:25:14.170,0:25:16.679
is to make the system decoupled

0:25:16.679,0:25:19.300
it's easier to make things decoupled than[br]coupled

0:25:19.300,0:25:21.510
if they're all running in different languages

0:25:21.510,0:25:25.210
so in the past three months, I'll say

0:25:25.210,0:25:30.490
I have written production code in objective[br]CRuby, Scala, Closure, Node

0:25:30.490,0:25:34.059
I don't know, more stuff, Java

0:25:34.059,0:25:35.670
all these different languages

0:25:35.670,0:25:38.809
real code for work

0:25:38.809,0:25:40.550
and yes, they are not tightly coupled

0:25:40.550,0:25:44.650
like I haven't installed JRuby so that I could[br]reach into the internals of my Scala code

0:25:44.650,0:25:45.630
because that would be a pain

0:25:45.630,0:25:50.730
I don't want to do that

0:25:50.730,0:25:52.960
another very important one is

0:25:52.960,0:25:55.559
server nodes are disposable

0:25:55.559,0:25:59.429
so, back when I was at GE, for example

0:25:59.429,0:26:02.730
I remember being really proud when I looked[br]at the up time of one of my servers

0:26:02.730,0:26:05.480
and it was like four hundred days or something

0:26:05.480,0:26:07.150
it's like, wow, this is awesome

0:26:07.150,0:26:09.750
I have this big server, it had all these apps[br]on it

0:26:09.750,0:26:12.940
we kept it running for four hundred days

0:26:12.940,0:26:14.809
the problem with that is I was afraid to ever[br]touch it

0:26:14.809,0:26:17.510
I was really happy it was alive

0:26:17.510,0:26:18.860
but I didn't want to do anything to it

0:26:18.860,0:26:21.250
I was afraid to update the operating system

0:26:21.250,0:26:23.770
in fact you could not upgrade Solaris then[br]without restarting it

0:26:23.770,0:26:27.540
so that meant I had not upgrading the operating[br]system

0:26:27.540,0:26:32.390
I probably shouldn't have been too proud about[br]it

0:26:32.390,0:26:34.890
Nodes that are alive for a long time lead[br]to fear

0:26:34.890,0:26:37.440
and what I want is less fear

0:26:37.440,0:26:39.340
so I throw them away

0:26:39.340,0:26:42.900
and this means I don't have physical servers[br]that I throw away

0:26:42.900,0:26:45.920
that would be fun but I'm not that rich yet

0:26:45.920,0:26:49.160
we use AWS right now, you could do it with[br]any kind of cloud service

0:26:49.160,0:26:52.640
or even internal cloud divider

0:26:52.640,0:26:53.780
but every node is disposable

0:26:53.780,0:27:00.550
so, we never upgrade software on an existing[br]server

0:27:00.550,0:27:03.150
whenever you want to deploy a new version[br]of a service

0:27:03.150,0:27:04.370
you create new servers

0:27:04.370,0:27:05.429
and you deploy that version

0:27:05.429,0:27:08.790
and then you replace them in the load balance[br]or somewhere

0:27:08.790,0:27:10.200
that's it

0:27:10.200,0:27:13.100
so, you never have to wonder what's on a server

0:27:13.100,0:27:15.620
because it was deployed through an automated[br]process

0:27:15.620,0:27:16.840
and there's no fear there

0:27:16.840,0:27:17.980
you know exactly what it is

0:27:17.980,0:27:19.320
you know exactly how to recreate it

0:27:19.320,0:27:21.540
because you have a golden master image

0:27:21.540,0:27:24.200
and in our case it's actually an Amazon image

0:27:24.200,0:27:26.380
that you can just boot more of

0:27:26.380,0:27:27.440
scaling is a problem

0:27:27.440,0:27:29.070
you just boot ten more servers

0:27:29.070,0:27:32.520
boom, done, no problem

0:27:32.520,0:27:35.450
so yeah I tell the team, you know, pick your[br]technology

0:27:35.450,0:27:38.090
everything must be automated, that's another[br]piece

0:27:38.090,0:27:43.059
if you're going to deploy a closure service[br]for the first time

0:27:43.059,0:27:46.760
you have to be responsible for figuring out[br]how it fits into our deployment system

0:27:46.760,0:27:50.309
so that you have immutable deployments and[br]disposable nodes

0:27:50.309,0:27:53.760
if you can do that and you're willing to also[br]maintain it and teach someone else

0:27:53.760,0:27:55.910
about the little piece of code that you wrote,[br]then cool

0:27:55.910,0:27:59.010
you can do it, any level you want

0:27:59.010,0:28:02.929
and then once you deploy stuff

0:28:02.929,0:28:05.250
like a lot of us like to just SFH in the machines

0:28:05.250,0:28:07.679
and then twiddle with things and replace files

0:28:07.679,0:28:11.660
and like try like fixing bugs live on production

0:28:11.660,0:28:13.990
why no just throw away the actual keys

0:28:13.990,0:28:16.590
because you're going to throw away the system[br]eventually

0:28:16.590,0:28:19.140
you don't even need route access to it

0:28:19.140,0:28:21.490
you don't need to be able to get to it

0:28:21.490,0:28:24.980
except through the port that your service[br]is listening on

0:28:24.980,0:28:26.840
so you can't screw it up

0:28:26.840,0:28:29.470
you can't introduce entropy and mess things[br]up

0:28:29.470,0:28:31.470
if you throw away the keys

0:28:31.470,0:28:33.640
so this is actually a practice that you can[br]do

0:28:33.640,0:28:36.460
deploy the servers, remove all the credentials

0:28:36.460,0:28:39.299
for logging in and the only option you have

0:28:39.299,0:28:43.610
is to destroy them when you're done with them

0:28:43.610,0:28:45.140
provisioning new services in our world

0:28:45.140,0:28:46.960
must also be trivial

0:28:46.960,0:28:51.370
so we have actually now thrown away our chef[br]repository

0:28:51.370,0:28:54.340
because chef is obsolete and

0:28:54.340,0:28:56.049
we have replaced it with shell scripts

0:28:56.049,0:29:01.340
and that sounds like I'm an idiot

0:29:01.340,0:29:04.460
I know, but when I say chef is obsolete

0:29:04.460,0:29:05.480
I don't really mean that

0:29:05.480,0:29:07.100
I like to say that so that people will think

0:29:07.100,0:29:08.450
because a lot of you are probably thinking

0:29:08.450,0:29:11.040
we should move to chef

0:29:11.040,0:29:11.809
that would be great

0:29:11.809,0:29:13.530
because what you have is a bunch of servers

0:29:13.530,0:29:14.670
that are running for a long time

0:29:14.670,0:29:17.110
and you need to be able to continue to keep[br]them up to date

0:29:17.110,0:29:19.150
chef is really great at that

0:29:19.150,0:29:22.059
chef is also good at booting a new server

0:29:22.059,0:29:24.340
but really it's just overkill for that

0:29:24.340,0:29:25.059
yeah

0:29:25.059,0:29:26.460
so if you're always throwing stuff away

0:29:26.460,0:29:27.809
I don't think you need chef

0:29:27.809,0:29:29.160
do something really, really simple

0:29:29.160,0:29:29.950
and that's what we've done

0:29:29.950,0:29:33.090
so like whenever we deploy a new type of service

0:29:33.090,0:29:37.730
I set up ZooKepper recently, which is a complete[br]change from the other stuff we're deploying

0:29:37.730,0:29:39.980
I think it was a five line shell script to[br]do that

0:29:39.980,0:29:42.590
I just added it to a get repo and run a command

0:29:42.590,0:29:47.340
I've got a cluster of ZooKeeper servers running

0:29:47.340,0:29:51.260
you want to always be deploying your software

0:29:51.260,0:29:55.570
this is something I learned from Kent Beck[br]early on in the agile extreme programming

0:29:55.570,0:29:56.330
world

0:29:56.330,0:29:57.980
that if something is hard

0:29:57.980,0:30:00.420
or you perceive it to be hard or difficult

0:30:00.420,0:30:02.290
the best thing you can do

0:30:02.290,0:30:04.390
if you have to do that thing all the time

0:30:04.390,0:30:07.000
is to just do it constantly

0:30:07.000,0:30:09.090
non-stop all the time

0:30:09.090,0:30:10.910
so like deploying in our old world

0:30:10.910,0:30:15.280
where it would take all night once a week

0:30:15.280,0:30:18.040
if we instituted a new policy

0:30:18.040,0:30:19.270
in that team that said

0:30:19.270,0:30:23.100
any change that goes to master must be deployed[br]within five minutes

0:30:23.100,0:30:28.410
I guarantee you we would have fixed that process,[br]right

0:30:28.410,0:30:29.730
and if you're deploying constantly

0:30:29.730,0:30:31.080
all day every day

0:30:31.080,0:30:33.120
you're never going to be afraid of deployments

0:30:33.120,0:30:36.020
because it's always a small change

0:30:36.020,0:30:37.929
so always be deploying

0:30:37.929,0:30:40.410
every new deploy means you're throwing away[br]old servers

0:30:40.410,0:30:42.600
and replacing them with new ones

0:30:42.600,0:30:45.610
in our world I would say that the average[br]uptime

0:30:45.610,0:30:48.240
of one of our servers is probably something[br]like

0:30:48.240,0:30:55.179
seventeen hours and that's because we don't[br]tend to work on the weekend very much

0:30:55.179,0:30:56.870
you also, when you have these sorts of systems

0:30:56.870,0:30:58.710
that are distributed like this

0:30:58.710,0:31:02.100
and you're trying to reduce the fear of change

0:31:02.100,0:31:04.350
the big thing that you're afraid of is failure

0:31:04.350,0:31:06.110
you're afraid that the service is going to[br]fail

0:31:06.110,0:31:07.110
the system is going to go down

0:31:07.110,0:31:10.070
one component won't be reachable, that sort[br]of thing

0:31:10.070,0:31:12.370
so you just to have assume that that's going[br]to happen

0:31:12.370,0:31:17.210
you are not going to build a system that never[br]fails, ever

0:31:17.210,0:31:19.740
I hope you don't, because you will have wasted[br]much of your life

0:31:19.740,0:31:21.100
trying to get that to happen

0:31:21.100,0:31:24.309
instead, assume that the thing, the components[br]are going to fail

0:31:24.309,0:31:25.960
and build resiliency in

0:31:25.960,0:31:28.030
I have a picture here of Joe Armstrong

0:31:28.030,0:31:30.380
who is one of the inventors of Erlang

0:31:30.380,0:31:34.890
if you have not studied Erlang philosophy[br]around failure and recovery

0:31:34.890,0:31:35.340
you should

0:31:35.340,0:31:36.470
and it won't take you long

0:31:36.470,0:31:39.070
so I'm just going to leave that as homework[br]for you

0:31:39.070,0:31:42.110
and then, you know, I said, the tests are[br]a design pattern

0:31:42.110,0:31:43.540
I don't mean don't write any tests

0:31:43.540,0:31:45.950
but I also want to be further responsible[br]here

0:31:45.950,0:31:50.540
and say you should monitor everything

0:31:50.540,0:31:52.880
you want to favor measurement over testing

0:31:52.880,0:31:57.130
so I use measurement as a surrogate for testing

0:31:57.130,0:31:57.850
or as an enhancement

0:31:57.850,0:32:03.980
and the reason I say this is

0:32:03.980,0:32:05.650
you can either focus on one of two things

0:32:05.650,0:32:07.790
I said assume failure right, so

0:32:07.790,0:32:12.370
mean time between failures or mean time to[br]resolution

0:32:12.370,0:32:16.200
those are kind of two metrics in the ops world

0:32:16.200,0:32:17.400
that people talk about

0:32:17.400,0:32:20.140
for measuring their success and their effectiveness

0:32:20.140,0:32:21.980
mean time between failures means

0:32:21.980,0:32:25.360
you're trying to increase the time between[br]failures

0:32:25.360,0:32:29.290
of the system, so basically you're trying[br]to make failures never happen, right

0:32:29.290,0:32:31.059
mean time to resolution means

0:32:31.059,0:32:34.679
when they happen, I'm gonna focus on bringing[br]them back

0:32:34.679,0:32:37.290
as fast as I possibly can

0:32:37.290,0:32:41.120
so a perfect example would be a system fails

0:32:41.120,0:32:43.720
and another one is already up and just takes[br]over its work

0:32:43.720,0:32:46.679
mean time to resolution is essentially zero,[br]right

0:32:46.679,0:32:50.679
if you're always assuming that every component[br]can will fail

0:32:50.679,0:32:53.770
then mean time to resolution is going to be[br]really good

0:32:53.770,0:32:56.240
because you're going to bake it into the process

0:32:56.240,0:32:59.480
if you do that, you don't care about when[br]things fail

0:32:59.480,0:33:02.640
and back to this idea of favoring measurement[br]over testing

0:33:02.640,0:33:07.250
if you're monitoring everything, everything[br]with intelligence

0:33:07.250,0:33:10.390
then you're actually focusing on mean time[br]to resolution

0:33:10.390,0:33:15.750
and acknowledging that the software is going[br]to be broken sometimes, right

0:33:15.750,0:33:18.200
and when I say monitor everything, I mean[br]everything

0:33:18.200,0:33:21.940
I don't mean, like your disk space and your[br]memory and stuff there

0:33:21.940,0:33:23.669
I'm talking about business metrics

0:33:23.669,0:33:27.630
so, at living social we created this thing[br]called rearview

0:33:27.630,0:33:29.250
which is now opensource

0:33:29.250,0:33:33.030
which allows you do to aberration detection

0:33:33.030,0:33:37.919
and aberration means strange behavior, strange[br]change in behavior

0:33:37.919,0:33:41.679
so rearview can do aberration detection

0:33:41.679,0:33:44.690
on data sets, arbitrary data sets

0:33:44.690,0:33:47.010
which means, like in the living social world

0:33:47.010,0:33:48.230
we had user sign ups

0:33:48.230,0:33:49.190
constantly streaming in

0:33:49.190,0:33:51.559
it was a very high volume site

0:33:51.559,0:33:53.799
if user sign-ups were weird

0:33:53.799,0:33:55.940
we would get an alert

0:33:55.940,0:33:57.540
why might they be weird?

0:33:57.540,0:34:00.830
one thing could be like the user service is[br]down, right

0:34:00.830,0:34:02.100
so then we would get two alerts

0:34:02.100,0:34:04.010
user sign ups have gone down

0:34:04.010,0:34:05.150
and so has the service

0:34:05.150,0:34:07.510
so obviously the problem is the service is[br]down

0:34:07.510,0:34:09.679
let's bring it back up

0:34:09.679,0:34:11.409
but it could be something like

0:34:11.409,0:34:13.349
a front-end developer or a designer

0:34:13.349,0:34:16.469
made a change that was intentional

0:34:16.469,0:34:18.040
but it just didn't work and no one liked it

0:34:18.040,0:34:21.168
so they didn't sign up to the site anymore

0:34:21.168,0:34:23.980
that's more important than just knowing that[br]the service is down

0:34:23.980,0:34:25.460
right, because what you care about

0:34:25.460,0:34:27.190
isn't that the service is up or down

0:34:27.190,0:34:30.540
if you could crash the entire system and still[br]be making money

0:34:30.540,0:34:31.859
you don't care, right, that's better

0:34:31.859,0:34:34.839
throw it away and stop paying for the servers

0:34:34.839,0:34:40.679
but if your system is up 100% of the time[br]and performs excellently

0:34:40.679,0:34:43.359
but no one's using it, that's bad

0:34:43.359,0:34:49.279
so monitoring business metrics gives you a[br]lot more than unit test could ever give you

0:34:49.279,0:34:50.899
and then in our world

0:34:50.899,0:34:51.859
we focused on experiencing

0:34:51.859,0:34:56.259
no, you have to come up to front and say ten!

0:34:56.259,0:34:59.220
ok, ten minutes left

0:34:59.220,0:35:01.989
when I got to 6WunderKinder in Berlin

0:35:01.989,0:35:04.069
everyone was terrified to touch the system

0:35:04.069,0:35:08.710
because they hadn't created a really well-designed

0:35:08.710,0:35:12.009
but traditional monolithic API

0:35:12.009,0:35:13.539
so they had layers of abstractions

0:35:13.539,0:35:15.289
it was all kind of in one big thing

0:35:15.289,0:35:16.519
they had a huge database

0:35:16.519,0:35:19.720
and they were really, really scared to do[br]anything

0:35:19.720,0:35:22.190
so there's like one person who would deploy[br]anything

0:35:22.190,0:35:24.190
and everyone else was trying to work on other[br]projects

0:35:24.190,0:35:25.950
and not touch it

0:35:25.950,0:35:27.859
but it was like the production system

0:35:27.859,0:35:29.960
you know so it wasn't really an option

0:35:29.960,0:35:31.880
so the first thing I did in my first week

0:35:31.880,0:35:34.920
is I got these graphs going

0:35:34.920,0:35:39.239
and this was, yeah, response time

0:35:39.239,0:35:42.749
and the first thing I did is I started turning[br]off servers

0:35:42.749,0:35:44.279
and just watching the graphs

0:35:44.279,0:35:47.749
and then, as I was turning off the servers

0:35:47.749,0:35:49.380
I went to the production database

0:35:49.380,0:35:54.220
and I did select, count, star from tasks

0:35:54.220,0:35:55.650
and we're a task management app

0:35:55.650,0:35:58.249
so we have hundreds of millions of tasks

0:35:58.249,0:36:00.910
and the whole thing crashed

0:36:00.910,0:36:04.119
and all the people were like AAAAH what's[br]going on

0:36:04.119,0:36:05.630
you know, and I said, it's no problem

0:36:05.630,0:36:08.539
I did this on purpose, I'll just make it come[br]back

0:36:08.539,0:36:10.119
which I did

0:36:10.119,0:36:11.079
and from that point on

0:36:11.079,0:36:13.349
like, really every day I would do something

0:36:13.349,0:36:16.999
which basically crash the system for just[br]a moment

0:36:16.999,0:36:19.819
and really, like, we had way too many servers[br]in production

0:36:19.819,0:36:22.690
we were spending tens of thousands more Euros[br]per month

0:36:22.690,0:36:25.079
than we should have on the infrastructure

0:36:25.079,0:36:27.499
and I just started taking things away

0:36:27.499,0:36:28.819
and I would usually do it

0:36:28.819,0:36:30.579
instead of the responsible way,

0:36:30.579,0:36:31.630
like one server at a time

0:36:31.630,0:36:34.079
I would just remove all of them and start[br]adding them back

0:36:34.079,0:36:36.220
so for a moment everything was down

0:36:36.220,0:36:38.809
but after that we go to a point where

0:36:38.809,0:36:41.299
everyone on the team was absolutely comfortable

0:36:41.299,0:36:42.720
with the worst case scenario

0:36:42.720,0:36:45.180
of the system being completely down

0:36:45.180,0:36:47.989
so that we could, in a panic free way

0:36:47.989,0:36:51.059
just focus on bringing it up when it was bad

0:36:51.059,0:36:52.940
so now when you do a deployment

0:36:52.940,0:36:54.710
and you have your business metrics being measured

0:36:54.710,0:36:57.160
you know the important stuff is happening

0:36:57.160,0:37:00.559
and you know what to do when everything is[br]down

0:37:00.559,0:37:02.509
you've experienced the worst thing that can[br]happen

0:37:02.509,0:37:04.690
well the worst thing is like someone breaks[br]in

0:37:04.690,0:37:07.789
and steals all your stuff, steals all your[br]users' phone numbers

0:37:07.789,0:37:10.140
and posts them online like SnapChat or something

0:37:10.140,0:37:13.650
but you've experienced all these potentially[br]horrible things

0:37:13.650,0:37:16.920
and realized, eh, it's not so bad, I can deal[br]with this

0:37:16.920,0:37:19.119
I know what do to

0:37:19.119,0:37:22.400
it allows you to start making bold moves

0:37:22.400,0:37:23.640
and that's what we all want right

0:37:23.640,0:37:28.739
we all want to be able to bravely go into[br]our systems

0:37:28.739,0:37:30.319
and do anything we think is right

0:37:30.319,0:37:33.869
so that's what I've been focusing on

0:37:33.869,0:37:36.769
we also do this thing called Canary in the[br]Coal Mine deployments

0:37:36.769,0:37:38.999
which removes the fear, also

0:37:38.999,0:37:43.479
canary in the coalmine refers to a kind of[br]sad thing

0:37:43.479,0:37:46.869
about coal miners in the US

0:37:46.869,0:37:49.400
where they would send canaries into the mines

0:37:49.400,0:37:50.380
at various levels

0:37:50.380,0:37:54.170
and if the canary died they knew there was[br]a problem

0:37:54.170,0:37:58.299
with the air

0:37:58.299,0:37:59.470
but in the software world

0:37:59.470,0:38:02.839
what this means is you have bunch of servers[br]running

0:38:02.839,0:38:06.400
or a bunch of, I don't know, clients running[br]a certain version

0:38:06.400,0:38:09.789
and you start introducing new version incrementally

0:38:09.789,0:38:11.769
and watching the effects

0:38:11.769,0:38:13.210
so once you're measuring everything

0:38:13.210,0:38:14.680
and monitoring everything

0:38:14.680,0:38:17.039
you can also start doing these canary in the[br]coalmine things

0:38:17.039,0:38:19.069
where you say OK I have a new version of this[br]service

0:38:19.069,0:38:20.369
that I'm going to deploy

0:38:20.369,0:38:22.769
and I've got thirty servers running for it

0:38:22.769,0:38:25.869
but I'm going to change only five of them[br]now

0:38:25.869,0:38:27.950
and see, like, does my error rate increase

0:38:27.950,0:38:30.180
or does my performance drop on those servers

0:38:30.180,0:38:33.880
or do people actually not successfully complete[br]the task they're trying to do

0:38:33.880,0:38:34.650
on those servers

0:38:34.650,0:38:39.989
so, this also allows us the combination of[br]monitoring everything

0:38:39.989,0:38:41.989
and these immutable deployments and everything

0:38:41.989,0:38:46.569
gives us the ability to gradually affect change[br]and not be afraid

0:38:46.569,0:38:48.460
so we roll out changes all day every day

0:38:48.460,0:38:53.819
because we don't fear that we're just going[br]to destroy the entire system all at once

0:38:53.819,0:38:55.880
so I think I have like five minutes left

0:38:55.880,0:39:00.239
uh, these are some things we're not necessarily[br]doing yet

0:39:00.239,0:39:01.970
but they're some ideas that I have

0:39:01.970,0:39:04.940
that given some free time I will work on

0:39:04.940,0:39:08.700
and, they're probably more exciting

0:39:08.700,0:39:11.319
one is I talked about homeostatic regulation

0:39:11.319,0:39:13.579
and homeostasis

0:39:13.579,0:39:16.640
so I think we all understand the idea of you[br]know homeostasis

0:39:16.640,0:39:20.019
and the fact that systems have different parts[br]that do different roles

0:39:20.019,0:39:21.819
and can protect each other from each other

0:39:21.819,0:39:27.819
but, so this diagram is actually just some[br]random diagram

0:39:27.819,0:39:30.789
I copied and pasted off the AWS website

0:39:30.789,0:39:33.589
so it's not necessarily all that meaningful

0:39:33.589,0:39:36.279
except to show that every architecture

0:39:36.279,0:39:38.680
especially server based architectures

0:39:38.680,0:39:42.979
has a collection of services that play different[br]roles

0:39:42.979,0:39:45.079
and it almost looks like a person

0:39:45.079,0:39:46.989
you've got a brain and a heart and a liver

0:39:46.989,0:39:50.690
and all these things, right

0:39:50.690,0:39:53.009
what would it mean to actually implement

0:39:53.009,0:39:56.539
homeostatic regulation in a web service?

0:39:56.539,0:39:59.539
so that you have some controlling system

0:39:59.539,0:40:02.579
where the database will actually kill an app[br]server

0:40:02.579,0:40:04.859
that is hurting it, for example

0:40:04.859,0:40:07.410
just kill it

0:40:07.410,0:40:08.769
I don't know yet, I don't know what that is

0:40:08.769,0:40:14.339
but some ideas about this stuff

0:40:14.339,0:40:15.569
I don't know if you've heard of these

0:40:15.569,0:40:19.690
NetFlix, do you have NetFlix in India yet?

0:40:19.690,0:40:23.499
probably not, unless you have a VPN, right

0:40:23.499,0:40:27.029
NetFlix has a really great cloud based architecture

0:40:27.029,0:40:29.839
they have this thing called Chaos Monkey they've[br]created

0:40:29.839,0:40:33.940
which goes through their system and randomly[br]destroys Nodes

0:40:33.940,0:40:35.960
just crashes servers

0:40:35.960,0:40:39.680
and they did this because, when they were,[br]they were early users of AWS

0:40:39.680,0:40:42.239
and when they went out initially with AWS,[br]servers were crashing

0:40:42.239,0:40:43.880
like it was still immature

0:40:43.880,0:40:46.279
so they said OK we still want to use this

0:40:46.279,0:40:49.769
and we'll build in stuff so that we can deal[br]with the crashes

0:40:49.769,0:40:52.210
but we have to know it's gonna work when it[br]crashes

0:40:52.210,0:40:55.410
so let's make crashing be part of production

0:40:55.410,0:40:58.210
so they actually have gotten really sophisticated[br]now

0:40:58.210,0:41:00.499
and they will crash entire regions

0:41:00.499,0:41:01.819
cause they're in multiple data centers

0:41:01.819,0:41:03.569
so they'll say like, what would happen if[br]this

0:41:03.569,0:41:06.479
data center went down, does the site still[br]stay up?

0:41:06.479,0:41:08.369
and they do this in production all the time

0:41:08.369,0:41:09.609
like they're crashing servers right now

0:41:09.609,0:41:11.130
it's really neat

0:41:11.130,0:41:14.079
another one that is inspirational in this[br]way

0:41:14.079,0:41:19.170
is Pinterest, they use AWS as well

0:41:19.170,0:41:22.450
and they have, AWS has this thing called Spot[br]Instances

0:41:22.450,0:41:24.440
and I won't go into too much detail

0:41:24.440,0:41:25.910
because I don't have time

0:41:25.910,0:41:29.869
but Spot Instances allow you to effectively

0:41:29.869,0:41:36.210
bid on servers at a price that you are willing[br]to pay

0:41:36.210,0:41:39.559
so like if a usual server costs $0.20 per[br]minute

0:41:39.559,0:41:42.319
you can say, I'll give $0.15 per minute

0:41:42.319,0:41:45.380
and when excess capacity comes open

0:41:45.380,0:41:47.710
it's almost like a stock market

0:41:47.710,0:41:50.299
if $0.15 is the going price, you'll get a[br]server

0:41:50.299,0:41:52.479
and it starts up and it runs what you want

0:41:52.479,0:41:54.400
but here's the cool thing

0:41:54.400,0:42:00.140
if the stock market goes and the price goes[br]higher than you're willing to pay

0:42:00.140,0:42:03.229
Amazon will just turn off those servers

0:42:03.229,0:42:05.219
they're just dead, you don't have any warning

0:42:05.219,0:42:06.579
they're just dead

0:42:06.579,0:42:11.069
so Pinterest uses this for their production[br]servers

0:42:11.069,0:42:13.749
which means they save a lot of money

0:42:13.749,0:42:17.269
they're paying way under the average Amazon[br]cost for hosting

0:42:17.269,0:42:19.309
but the really cool thing in my opinion

0:42:19.309,0:42:21.170
is not the money they save but the fact that

0:42:21.170,0:42:26.039
like, what would you have to do to build a[br]full system

0:42:26.039,0:42:29.119
where any node can and will die at any moment

0:42:29.119,0:42:31.259
and it's not even under your control

0:42:31.259,0:42:33.529
that's really exciting

0:42:33.529,0:42:36.259
so a simple thing you can do for homeostasis[br]though

0:42:36.259,0:42:37.509
is you can just adjust

0:42:37.509,0:42:39.489
so in our world we have multiple nodes

0:42:39.489,0:42:40.969
and all these little services

0:42:40.969,0:42:42.569
we can scale each one independently

0:42:42.569,0:42:44.569
we're measuring everything

0:42:44.569,0:42:46.400
so Amazon has a thing called Auto Scaling

0:42:46.400,0:42:49.469
we don't use it, we do our own scaling

0:42:49.469,0:42:54.119
and we just do it based on volume and performance

0:42:54.119,0:42:57.869
now when you have a bunch of services like[br]this

0:42:57.869,0:43:00.539
like, I don't know, maybe we have fifty different[br]services now

0:43:00.539,0:43:03.229
that each play tiny little roles

0:43:03.229,0:43:07.210
it becomes difficult to figure out, like,[br]where things are

0:43:07.210,0:43:10.619
so we've started implementing zookeeper for[br]service resolution

0:43:10.619,0:43:14.130
which means a service can come online and[br]say

0:43:14.130,0:43:17.539
I'm the reminder service version 2.3

0:43:17.539,0:43:19.349
and then tell a central guardian

0:43:19.349,0:43:21.979
and the zookeeper can then route traffic to[br]it

0:43:21.979,0:43:24.019
probably too detailed for now

0:43:24.019,0:43:28.420
I'm gonna skip over some stuff real quick

0:43:28.420,0:43:29.499
but I want to talk about this one

0:43:29.499,0:43:33.739
if, did the Nordic Ruby, no, Nordic Ruby talks[br]never go online

0:43:33.739,0:43:35.160
so you can never see this talk

0:43:35.160,0:43:36.630
sorry

0:43:36.630,0:43:41.499
at Nordic Ruby Reginald Braithwaite did a[br]really cool talk

0:43:41.499,0:43:44.130
on like challenges of the Ruby language

0:43:44.130,0:43:45.380
and he made this statement

0:43:45.380,0:43:48.869
Ruby has beautiful but static coupling

0:43:48.869,0:43:51.269
which was really strange

0:43:51.269,0:43:52.989
but basically he was making the same point[br]that

0:43:52.989,0:43:53.950
I was talking about earlier

0:43:53.950,0:43:59.210
that, like Ruby creates a bunch of ways that[br]you can couple

0:43:59.210,0:44:01.200
your system together

0:44:01.200,0:44:02.729
that kind of screw you in the end

0:44:02.729,0:44:03.960
but they're really beautiful to use

0:44:03.960,0:44:09.819
but, like, Ruby can really lead to some deep[br]crazy coupling

0:44:09.819,0:44:14.089
and so he presented this idea of bind by contract

0:44:14.089,0:44:17.930
and bind by contract, in a Ruby sense

0:44:17.930,0:44:22.539
would be, like, I have a class that has a[br]method

0:44:22.539,0:44:26.410
that takes these parameters under these conditions

0:44:26.410,0:44:29.420
and I can kind of put it into my VM

0:44:29.420,0:44:31.999
and whenever someone needs to have a functionality[br]like that

0:44:31.999,0:44:34.650
it will be automatically bound together

0:44:34.650,0:44:36.589
by the fact that it can do that thing

0:44:36.589,0:44:40.680
and instead of how we tend to use Ruby and[br]Java and other languages

0:44:40.680,0:44:42.910
I have a class with a method name I'm going[br]to call it

0:44:42.910,0:44:45.319
right, that's coupling

0:44:45.319,0:44:48.009
but he proposed this idea of this decoupled[br]system

0:44:48.009,0:44:50.609
where you just say I need a functionality[br]like this

0:44:50.609,0:44:53.390
that works under the conditions that I have[br]present

0:44:53.390,0:44:55.369
so this lead me to this idea

0:44:55.369,0:44:59.059
and this may be like way too weird, I don't[br]know

0:44:59.059,0:45:02.569
what if in your web application your routes[br]file

0:45:02.569,0:45:08.130
for your services read like a functional pattern[br]matching syntax

0:45:08.130,0:45:11.200
so like if you've ever used Erlang or Haskell[br]or Scala

0:45:11.200,0:45:14.509
any of these things that have functional pattern[br]matching

0:45:14.509,0:45:18.680
what if you could then route to different[br]services

0:45:18.680,0:45:20.880
across a bunch of different services

0:45:20.880,0:45:23.450
based on contract

0:45:23.450,0:45:27.279
now I have zero time left

0:45:27.279,0:45:29.029
but I'm just gonna keep talking, cause I'm[br]mean

0:45:29.029,0:45:30.349
oh wait I'm not allowed to be mean

0:45:30.349,0:45:31.579
because of the code of contact

0:45:31.579,0:45:34.759
so I'll wrap up

0:45:34.759,0:45:38.749
so this is an idea that I've started working[br]on as well

0:45:38.749,0:45:40.539
where I would actually write an Erlang service

0:45:40.539,0:45:42.700
with this sort of functional pattern matching

0:45:42.700,0:45:45.589
but have it be routing in really fast real[br]time

0:45:45.589,0:45:48.539
through back end services that support it

0:45:48.539,0:45:50.650
one more thing I just want to show you real[br]quick

0:45:50.650,0:45:53.869
that I am working on and I want to show you

0:45:53.869,0:45:57.910
because I want you to help me

0:45:57.910,0:46:00.690
has anyone used JSON schema?

0:46:00.690,0:46:05.890
OK, you people are my friends for the rest[br]of the conference

0:46:05.890,0:46:08.469
in a system where you have all these things[br]talking to each other

0:46:08.469,0:46:11.219
you do need a way to validate the inputs and[br]outputs

0:46:11.219,0:46:16.229
but I don't want to generate code that parses[br]and creates JSON

0:46:16.229,0:46:21.180
I don't want to do something in real time[br]that intercepts my

0:46:21.180,0:46:24.219
kind of traffic, so there's this thing called[br]JSON schema

0:46:24.219,0:46:27.219
that allows you to, in a completely decoupled[br]way

0:46:27.219,0:46:30.719
specify JSON documents and how they should[br]interact

0:46:30.719,0:46:35.849
and I am working on a new thing that's called[br]Klagen

0:46:35.849,0:46:38.299
which is the German word for complain

0:46:38.299,0:46:42.420
it's written in Scala, so if anyone wants[br]to pair up on some Scala stuff

0:46:42.420,0:46:47.700
what it will be is a high performance asynchronous[br]JSON schema validation middleware

0:46:47.700,0:46:52.749
so if that's interesting to anyone, even if[br]you don't know Scala or JSON schema

0:46:52.749,0:46:54.029
please let me know

0:46:54.029,0:46:57.099
and I believe I'm out of time so I'm just[br]gonna end there

0:46:57.099,0:46:58.609
am I right? I'm right, yes

0:46:58.609,0:47:01.529
so thank you very much, and let's talk during[br]the conference