-
-
Chad: Yes, hello, thank you.
-
Audience member: Hello!
-
Chad: Hello!
-
I am Chad, as he said.
-
He said I need no introduction
-
so I won't introduce myself any further.
-
I may be the biggest non-Indian fan of India
-
[Hindi speech]
-
-
I'll now switch back, sorry.
-
If you don't understand Hindi, I said nothing
of value
-
and it was all wrong.
-
But I was saying that my Hindi is bad
-
and it's because now I'm learning German
-
so I mixed them together, but I know not everyone
-
speaks Hindi here.
-
I just had to show off, you know
-
So, I am currently working on 6WunderKinder,
-
and I'm working on a product called Wunderlist.
-
It is a productivity application.
-
It runs on every client you can think of.
-
We have native clients, we have a back-end,
-
we have millions of active users,
-
and I'm telling you this not so that you'll
go download it -
-
you can do that too -
-
but I want to tell you about the challenges
that I have
-
and the way I'm starting to think about system's
architecture and design.
-
That's what I'm gonna talk about today
-
I'm going to show you some things that are
real
-
and that we're really doing.
-
I'm going to show you some things that are
-
just a fantasy that maybe don't make any sense
at all.
-
But hopefully I'll get you think about
-
how we think about system architecture
-
and how we build things that can last for
a long time.
-
So the first thing that I want to mention:
-
this is a graph from the Standish Chaos report
-
and I've taken the years out
-
and I've taken some of the raw data out
-
because it doesn't matter.
-
If you look at these, this graph,
-
each one of these bars is a year,
-
and each bar represents successful projects
in green -
-
software projects.
-
Challenged projects are in silver or white
in the middle
-
and then failed ones are in red.
-
But challenged means significantly over time
or budget
-
which to me means failed too.
-
So basically we're terrible,
-
all of us here, we're terrible.
-
We call ourselves engineers but it's a disgrace.
-
We very rarely actually launch things that
work.
-
Kind of sad,
-
and I am here to bring you down.
-
Then once you launch software, anecdotal-y,
-
and you probably would see this in your own
work lives, too,
-
anecdotal-y, software gets killed after about
five years -
-
business software.
-
So you barely ever get to launch it, because,
-
or at least successfully, in a way that you're
proud of,
-
and then in about five years
-
you end up in that situation where you're
doing a big rewrite
-
and throwing everything away and replacing
it.
-
You know there's always that project to get
rid of the junk,
-
old Java code or whatever that you wrote five
years ago,
-
replace it with Ruby now,
-
five years from now you'll be replacing your
old junk Ruby code
-
that didn't work with something else.
-
We create this thing, probably all of you
know the term legacy software -
-
Right, am I right? You know what legacy software
is,
-
and you probably think of it as a negative
thing.
-
You think of it as that ugly code that doesn't
work,
-
that's brittle, that you can't change, that
you're all afraid of.
-
But there's actually also a positive connotation
of the word legacy:
-
it's leaving behind something that future
generations can benefit from.
-
But if we're rarely ever launching successful
projects
-
and then the ones we do launch tend to die
within five years
-
none of us are actually creating a legacy
in our work.
-
We're just creating stuff that gets thrown
away.
-
Kind of sad.
-
So we create this stuff that's a legacy software.
-
It's hard to change, that's why it ends up
getting thrown away
-
right, that's, if the software worked
-
and you could keep changing it to meet the
needs of the business
-
you wouldn't need to do a big rewrite and
throw it away.
-
We create these huge tightly-coupled systems,
-
and I don't just mean one application,
-
but like many applications are all tightly
coupled.
-
You've got this thing over here talking to
the database of this system over here
-
so if you change the columns to update the
view of a webpage
-
you ruin your billing system, that kind of
thing
-
this is what makes it so hard to change
-
and the sad thing about this is the way we
work
-
the way we develop software, this is the default
setting
-
and, what I mean is, if we were robots churning
out software
-
and we had a preferences panel
-
the default preferences would lead to us creating
terrible software that gets thrown away in
-
five years
-
that's just how we all work
-
as human beings when we sit down to write
code
-
our default instincts lead to us to create
systems that are tightly coupled
-
and hard to change and ultimately get thrown
away and can't scale
-
we create, we try doing tests, we try doing
TDD
-
but we create test suites that take forty-five
minutes to run
-
every team has had to deal with this I'm sure
-
if you've written any kind of meaningful application
-
and it gets to where you have like a project
-
to speed up the test suite
-
like you start focusing your company's resources
-
on making the test suite faster
-
or making it like only fail ninety percent
of the time
-
and then you say well if it only fails ninety
percent that's OK
-
right, and right now it's taking forty-five
minutes
-
we want to get it to where it only takes ten
minutes to run
-
so the test suite ends up being a liability
instead of a benefit
-
because of the way you do it
-
because you have this architect where everything
is so coupled
-
you can't change anything without spending
hours working on the stupid test suite
-
and your terrified to deploy
-
I know like the last big Java project I was
working on
-
it would take, once a week we did a deploy
-
it would take fifteen people all night to
deploy the thing
-
and usually it was like copying class files
around
-
and restarting servers
-
it's much better today but it's still terrifying
-
you deploy code, you change it in production
-
you're not sure what might break
-
cause it's really hard to test these big integrated
things together
-
and actually upgrading the technology component
is terrifying
-
so, how many of you have been doing Rails
for more than three years?
-
do you have, like a Rails 2 app in production,
anyone? Yeah?
-
that's a lot of people, wow, that's terrifying
-
and I've been in situations, recently, where
we had Rails 2 apps in production
-
security patches are coming out, we were applying
our own versions
-
of those security patches
-
because we were afraid to upgrade Rails
-
we would rather hack it than upgrade the thing
-
because you just don't know what's gonna happen
-
and then you end up, as you're re-implementing
all this stuff yourself
-
you end up burning yourself out, wasting your
time
-
because you're hacking on stupid Rails 2
-
or some old struts version
-
when you should be just taking advantage of
the new patches
-
but you can't because you're afraid to upgrade
the software
-
because you don't know what's going to happen
-
because the system is too big and too scary
-
then, and this is really bad, I think this
is something
-
Ruby messes up for all of us
-
I say this as someone who's been using Ruby
for thirteen years now
-
happily
-
we create these mountains of abstractions
-
and the logic ends up being buried inside
them
-
I mean in Java it was like static, or, you
know, factories
-
and design pattern soup
-
in Ruby its modules and mixins and you know
-
we have all these crazy ways of hiding what's
actually happening from us
-
but when you go look at the code
-
it's completely opaque
-
you have no idea where the stuff actually
gets done
-
because it's in some magic library somewhere
-
and we do all that because we're trying to
save ourselves from the complexity of these
-
big nasty systems
-
but like if you look at the rest of the world
-
this is a software specific problem
-
these cars are old, they're older than any
software that you would ever run
-
and they're still driving down the street
-
they're older than software itself, right
-
but these things still function, they still
work
-
how? why? why do they work?
-
bodies! my body should not work
-
I have abused it
-
I should not be standing here today
-
I shouldn't have been able to come from Berlin
here
-
without dying somehow by being in the air
-
you know, by the air pressure changes
-
but our bodies somehow can survive even when
-
we don't take care of them
-
and like it's just the system that works,
right
-
so how do our bodies work?
-
how do we stay alive
-
despite this fact
-
even though we haven't done like some
-
great design, we don't have any design patterns
-
like mixed up into our bodies
-
in biology there is a term called homeostasis
-
and I literally don't know what this means
-
other than this definition
-
so you won't learn about this from me
-
there's probably at least one biologist in
the room
-
so you can correct me later
-
but basically the idea of homeostasis is
-
that an organism has all these different components
-
that serve different purposes
-
that regulate it
-
so they're all kind of in balance
-
and they work together to regulate the system
-
if one component, like a liver, does too much
-
or does the wrong thing
-
another component kicks in and fixes it
-
and so our bodies are this well designed system
-
for staying alive
-
because we have almost like autonomous agents
-
internally that take care of the many things
that can and do go wrong
-
on a regular basis
-
so you have, you know, your brain, your liver
-
your liver, of course, metabolizes toxic substances
-
your kidney deals with blood, water level,
et cetera
-
you know all these things work in concert
to make you live
-
the inability to continue to do that is known
as homeostatic imbalance
-
so I was saying, homeostasis is balancing
-
not being able to do that is when you're out
of balance
-
and that will actually lead to really bad
health problems
-
or probably death, if you fall into homeostatic
imbalance
-
so the good news is you're already dying
-
like we're all dying all the time
-
this is the beautiful thing about death
-
there is, there is an estimate that fifty
trillion cells
-
are in your body, and three million die per
second
-
it's an estimate because it's actually impossible
to count
-
but scientists have figured out somehow that
this is probably the right number
-
so your cells, you've probably heard this
all your life
-
like physically, after some amount of time,
-
you aren't the same human being that you were,
physically
-
you know, I don't know, you some period of
time ago
-
you're literally not the same organism anymore
-
but you're the same system
-
kind of interesting, isn't it
-
so in a way you can think about software this
-
you can think about software as a system
-
if the components could be replaced like these
cells
-
like, if you focus on making death, constant
death OK
-
on a small level
-
then the system can live on a large level
-
that's what this talk is about
-
solution, the solution being to mimic living
organisms
-
and as an aside, I will say many times the
word small or tiny in this talk
-
because I think I'm learning, as I age
-
that small is good
-
its, small projects are good
-
you know how to estimate them
-
small commitments are good
-
because you know you can make them
-
small methods are good
-
small classes are good
-
small applications are good
-
small teams are good
-
so I don't know, this is sort of a non sequitur
-
so if we're going to think about software
-
as like an organism
-
what is a cell in that context?
-
this is sort of the key question that you
have to ask yourself
-
and I say that a cell is a tiny component
-
now, tiny and component are both subjective
words
-
so you can kind of do what you want with that
-
but it's a good frame of thinking
-
if you make your software system of tiny components
-
each one can be like a cell
-
each one can die and the system is a collection
of those tiny components
-
and what you want is not for your code to
live forever
-
you don't care that each line of code lives
forever, right
-
like if you're trying to develop a legacy
in software
-
it's not important to you that your system
dot out dot printline statement
-
lives for ten years
-
it's important to you that the function of
the system lives for ten years
-
so like, about exactly ten years ago
-
we created Ruby gems at the RubyConf 2003
in Austin, Texas
-
I haven't touched Ruby gems myself in like
four or five years
-
but people are still using it
-
they hate it because it's software
-
everybody hates software right
-
so if you can create software that people
hate
-
you've succeeded
-
but it still exists
-
I have no idea if any of the code is the same
-
I would assume not
-
you know I think, I'm sure that my name is
still in it in a copyright notice
-
but that's about it
-
and that's a beautiful thing
-
people are still using it to install Ruby
libraries
-
and software
-
and I don't care if any of my existing, or
my initial code is still in the system
-
because the system still lives
-
so, quite a long time ago now I was researching
this kind of question
-
about Legacy software
-
and I asked a question on Twitter as I often
do at conferences
-
when I'm preparing
-
what are some of the old surviving software
systems you regularly use
-
and if you look at this, I mean, one thing
is obviously
-
everyone who answered gave some sort of Unix
related answer
-
but basically all of these things on this
list
-
are either systems that are collections of
really well-known split-up components
-
or they're tiny, tiny programs
-
so, like, grep is a tiny program, make
-
it only does one thing
-
well make is actually also arguably an operating
system
-
but I won't get into that
-
emacs is obviously an operating system, right
-
but it's well designed of these tiny little
pieces
-
so a lot of the old systems I know about follow
this pattern
-
this metaphor that I'm proposing
-
and from my own career
-
when I was here before in Banglore
-
I worked for GE and some of the people
-
we hired even worked on the system there
-
we had a system called the Bull
-
and it was a Honeywell Bull mainframe
-
I doubt any of you have worked on that
-
but this one I know you didn't work on
-
because it had a custom operating system
-
with our own RDVMS
-
we had created a PCP stack for it
-
using like custom hardware that we plugged
into a Windows MT computer
-
with some sort of MT queuing system back in
the day
-
it was this terrifying thing
-
when I started working there the system was
already something like twenty-five years old
-
and I believe even though there have been
many, many projects
-
to try to kill it, like we had a team called
the Bull exit team
-
I believe the system is still in production
-
not as much as it used to be, there are less
and less functions in production
-
but I believe the system is still in production
-
the reason for this is that the system was
actually made up of these tiny little components
-
and like really queer interfaces between them
-
and we kept the system live because every
time we tried to replace it
-
with some fancy new gem, web thing or gooey
app
-
it wasn't as good, and the users hated it
-
it just didn't work
-
so we had to use this old, crazy, modified
mainframe
-
for a long time as a result
-
so, the question I ask myself is now
-
how do I, how do I approach a problem like
this
-
and build a system that can survive for a
long time
-
I would encourage you
-
how many of you know of Fred George
-
this is Fred George
-
he was at ThoughtWorks for awhile
-
so he may have, I think he lived in Banglore
-
for some time with ThoughtWorks, in fact
-
he is now running a start-up in Silicon Valley
-
but he has this talk that you can watch online
-
from the Barcelona Ruby Conference the year
before last
-
called Microservice Architectures
-
and he talks in great detail about he,
-
how he implemented a concept at forward
-
that's very much like what I'm talking about
-
tiny components that only do one thing and
can be thrown away
-
so Microservice Architecture is kind of the
core of what I'm gonna talk about
-
now I've put together some rules for 6WunderKinder
-
which I am going to share with you
-
6WunderKinder is the company I work for
-
when we're working on Wunderlist
-
and the rules of the, the goals of these rules
-
are to reduce coupling, to make it where we
can do fear-free deployments
-
we reduce the chance of "cruft" in our code
-
like nasty stuff that you're afraid of
-
that you leave there, kind of broken window
problems
-
we make it literally trivial to change code
-
so you just never have to ask how do I do
that
-
you just find it easy
-
and most importantly we give ourselves the
freedom to go fast
-
because I think no developer ever wants to
be slow
-
that's one of the worst things
-
just toiling away and not actually accomplishing
anything
-
but we go slow because we're constrained by
the system
-
and we're constrained by, sometimes projects
-
and other, you know, management related things
-
but often times its the mess of the system
that we've created
-
so some of the rules
-
I think one thing, and maybe, maybe I'm going
to get some push back from this crowd
-
one rule that is less controversial than it
used to be
-
is that comments are a design smell
-
does anyone strongly disagree with that?
-
no?
-
does anyone strongly agree with that?
-
OK, so the rest of you have no idea what I'm
talking about
-
so a design smell, I want to define this really
quickly
-
a design smell is something you see in your
code or your system
-
where it doesn't necessarily mean it's bad
-
but you look at it and you think
-
hmm, I should look into this a little bit
-
and ask myself, why are there so many comments
in this code?
-
you know, especially the bottom one
-
inline comments?
-
definitely bad, definitely a sign that you
should have another method, right
-
so it's pretty easy to convince people
-
that comments are a design smell
-
and I think a lot of people in the industry
-
are starting to agree
-
maybe not for like a public library
-
where you really need to tell someone
-
here's how you use this class and this is
what it's for
-
but you shouldn't have to document every method
-
and every argument because the method name
and the argument name
-
should speak for themselves, right
-
so here's one that you probably won't agree
with
-
tests are a design smell
-
so this one is probably a little more controversial
-
especially in an environment where you're
maybe still struggling people
-
struggling with people to actually get them
to write tests to begin with, right
-
you know I went through this period in, like,
2000 and 2001
-
where I was really heavily into evangelizing
TDD
-
and it was really stressful that you couldn't
get anyone to do it
-
I think you do have to go through that period
-
and I'm not saying you shouldn't write any
tests
-
but that picture I showed you earlier of the
slow, brittle test suite
-
that's bad, right
-
that is a bad state to be in
-
and you're in that state because your tests
suck
-
that's why you get in that state
-
your tests suck because you're writing bad
tests
-
that don't exercise the right things in your
system
-
and what I've found is whenever I look into
one of these
-
big slow brittle test suites
-
the tests themselves are indications
-
and the sheer proliferation of tests
-
are indications that the system is bad
-
and the developers are like desperately
-
fearfully trying to run the code
-
in every way they can
-
because it's the only way they can manage
-
to even think about the complexity
-
but if you think about it, if you had a tiny
trivial system
-
you wouldn't need to have hundreds of test
files
-
that take ten minutes to run, ever
-
if you did, you're doing something stupid
-
you're wasting your time working on tests
-
and we as software developers obsess about
this kind of thing
-
because we have to fight so hard to get our
peers to do it in the first place
-
and to understand it
-
we obsess to the point where we focus on the
wrong thing
-
none of us are in the business of writing
tests for customers
-
like we're not launching our tests on the
web
-
and hoping people will buy them, right
-
it doesn't provide value, it's just a side-effect
-
that we have focused too heavily on
-
and we've lost sight of what the actual goal
is
-
so, this one actually requires a visual
-
I tell the people on my team now
-
you can write code in any language you want
-
any framework you want, anything you want
to do
-
as long as the code is this big
-
so if you want to write the new service in
Haskell
-
and it's this big in a normal size font
-
you can do it
-
if you want to do it in Closure or Elixir
or Scarla or Ruby
-
or whatever you want to do
-
even Python for god's sake
-
you can do it if it's this big and no bigger
-
why? because it means I can look at it
-
and I can understand it
-
or if I don't I'll just throw it away
-
because if it's this big it doesn't do very
much, right
-
so the risk is really low
-
and I really mean the system is that
-
there are the, the component is that big
-
and in my world a component means a service
-
that's running and probably listening on an
HTTP board
-
or some sort of rift or RPC protocol
-
so it's a standalone thing
-
it's its own application
-
it's probably in its own git repository
-
people do poll requests against it
-
but it's just tiny
-
so this big
-
at the top of this, by the way
-
is some code by Konstantin Haase
-
who also lives in Berlin, where I live
-
this is a rewrite of Sinatra
-
the web framework
-
and Konstantin is actually the maintainer
of Sinatra
-
it's not fully compatible, but it's amazingly
close
-
and it all fits right in that
-
but the font size is kind of small, so I cheated
-
another rule, our systems are heterogeneous
by default
-
so I say you can write in any language you
want
-
that's not just because I want the developers
to be excited
-
although I think, most of you, if you worked
-
in an environment where your boss told you
-
you can use any programming language or tool
you want
-
you would be pretty happy about that, right
-
anyone unhappy about that? I don't think so
-
unless it's one of the bosses here
-
that's like don't tell people that
-
so that's one thing
-
the other one is, it leads to a good system
design
-
because think about this
-
if I write one program in Erlang, one component
in Erlang
-
one program in Ruby
-
I have to work really, really hard to make
tight coupling
-
between those things
-
like I have to basically use computer science
to do that
-
I don't even know what I would do
-
you know it's hard
-
like I would have to maybe implement Ruby
in Erlang
-
so that it can run in the same BM or vice
versa
-
it's just silly, I wouldn't do it
-
so if my system is heterogeneous by default
-
my coupling is very low, at least at a certain
level by default
-
because it's the path of least resistance
-
is to make the system decoupled
-
it's easier to make things decoupled than
coupled
-
if they're all running in different languages
-
so in the past three months, I'll say
-
I have written production code in objective
CRuby, Scala, Closure, Node
-
I don't know, more stuff, Java
-
all these different languages
-
real code for work
-
and yes, they are not tightly coupled
-
like I haven't installed JRuby so that I could
reach into the internals of my Scala code
-
because that would be a pain
-
I don't want to do that
-
another very important one is
-
server nodes are disposable
-
so, back when I was at GE, for example
-
I remember being really proud when I looked
at the up time of one of my servers
-
and it was like four hundred days or something
-
it's like, wow, this is awesome
-
I have this big server, it had all these apps
on it
-
we kept it running for four hundred days
-
the problem with that is I was afraid to ever
touch it
-
I was really happy it was alive
-
but I didn't want to do anything to it
-
I was afraid to update the operating system
-
in fact you could not upgrade Solaris then
without restarting it
-
so that meant I had not upgrading the operating
system
-
I probably shouldn't have been too proud about
it
-
Nodes that are alive for a long time lead
to fear
-
and what I want is less fear
-
so I throw them away
-
and this means I don't have physical servers
that I throw away
-
that would be fun but I'm not that rich yet
-
we use AWS right now, you could do it with
any kind of cloud service
-
or even internal cloud divider
-
but every node is disposable
-
so, we never upgrade software on an existing
server
-
whenever you want to deploy a new version
of a service
-
you create new servers
-
and you deploy that version
-
and then you replace them in the load balance
or somewhere
-
that's it
-
so, you never have to wonder what's on a server
-
because it was deployed through an automated
process
-
and there's no fear there
-
you know exactly what it is
-
you know exactly how to recreate it
-
because you have a golden master image
-
and in our case it's actually an Amazon image
-
that you can just boot more of
-
scaling is a problem
-
you just boot ten more servers
-
boom, done, no problem
-
so yeah I tell the team, you know, pick your
technology
-
everything must be automated, that's another
piece
-
if you're going to deploy a closure service
for the first time
-
you have to be responsible for figuring out
how it fits into our deployment system
-
so that you have immutable deployments and
disposable nodes
-
if you can do that and you're willing to also
maintain it and teach someone else
-
about the little piece of code that you wrote,
then cool
-
you can do it, any level you want
-
and then once you deploy stuff
-
like a lot of us like to just SFH in the machines
-
and then twiddle with things and replace files
-
and like try like fixing bugs live on production
-
why no just throw away the actual keys
-
because you're going to throw away the system
eventually
-
you don't even need route access to it
-
you don't need to be able to get to it
-
except through the port that your service
is listening on
-
so you can't screw it up
-
you can't introduce entropy and mess things
up
-
if you throw away the keys
-
so this is actually a practice that you can
do
-
deploy the servers, remove all the credentials
-
for logging in and the only option you have
-
is to destroy them when you're done with them
-
provisioning new services in our world
-
must also be trivial
-
so we have actually now thrown away our chef
repository
-
because chef is obsolete and
-
we have replaced it with shell scripts
-
and that sounds like I'm an idiot
-
I know, but when I say chef is obsolete
-
I don't really mean that
-
I like to say that so that people will think
-
because a lot of you are probably thinking
-
we should move to chef
-
that would be great
-
because what you have is a bunch of servers
-
that are running for a long time
-
and you need to be able to continue to keep
them up to date
-
chef is really great at that
-
chef is also good at booting a new server
-
but really it's just overkill for that
-
yeah
-
so if you're always throwing stuff away
-
I don't think you need chef
-
do something really, really simple
-
and that's what we've done
-
so like whenever we deploy a new type of service
-
I set up ZooKepper recently, which is a complete
change from the other stuff we're deploying
-
I think it was a five line shell script to
do that
-
I just added it to a get repo and run a command
-
I've got a cluster of ZooKeeper servers running
-
you want to always be deploying your software
-
this is something I learned from Kent Beck
early on in the agile extreme programming
-
world
-
that if something is hard
-
or you perceive it to be hard or difficult
-
the best thing you can do
-
if you have to do that thing all the time
-
is to just do it constantly
-
non-stop all the time
-
so like deploying in our old world
-
where it would take all night once a week
-
if we instituted a new policy
-
in that team that said
-
any change that goes to master must be deployed
within five minutes
-
I guarantee you we would have fixed that process,
right
-
and if you're deploying constantly
-
all day every day
-
you're never going to be afraid of deployments
-
because it's always a small change
-
so always be deploying
-
every new deploy means you're throwing away
old servers
-
and replacing them with new ones
-
in our world I would say that the average
uptime
-
of one of our servers is probably something
like
-
seventeen hours and that's because we don't
tend to work on the weekend very much
-
you also, when you have these sorts of systems
-
that are distributed like this
-
and you're trying to reduce the fear of change
-
the big thing that you're afraid of is failure
-
you're afraid that the service is going to
fail
-
the system is going to go down
-
one component won't be reachable, that sort
of thing
-
so you just to have assume that that's going
to happen
-
you are not going to build a system that never
fails, ever
-
I hope you don't, because you will have wasted
much of your life
-
trying to get that to happen
-
instead, assume that the thing, the components
are going to fail
-
and build resiliency in
-
I have a picture here of Joe Armstrong
-
who is one of the inventors of Erlang
-
if you have not studied Erlang philosophy
around failure and recovery
-
you should
-
and it won't take you long
-
so I'm just going to leave that as homework
for you
-
and then, you know, I said, the tests are
a design pattern
-
I don't mean don't write any tests
-
but I also want to be further responsible
here
-
and say you should monitor everything
-
you want to favor measurement over testing
-
so I use measurement as a surrogate for testing
-
or as an enhancement
-
and the reason I say this is
-
you can either focus on one of two things
-
I said assume failure right, so
-
mean time between failures or mean time to
resolution
-
those are kind of two metrics in the ops world
-
that people talk about
-
for measuring their success and their effectiveness
-
mean time between failures means
-
you're trying to increase the time between
failures
-
of the system, so basically you're trying
to make failures never happen, right
-
mean time to resolution means
-
when they happen, I'm gonna focus on bringing
them back
-
as fast as I possibly can
-
so a perfect example would be a system fails
-
and another one is already up and just takes
over its work
-
mean time to resolution is essentially zero,
right
-
if you're always assuming that every component
can will fail
-
then mean time to resolution is going to be
really good
-
because you're going to bake it into the process
-
if you do that, you don't care about when
things fail
-
and back to this idea of favoring measurement
over testing
-
if you're monitoring everything, everything
with intelligence
-
then you're actually focusing on mean time
to resolution
-
and acknowledging that the software is going
to be broken sometimes, right
-
and when I say monitor everything, I mean
everything
-
I don't mean, like your disk space and your
memory and stuff there
-
I'm talking about business metrics
-
so, at living social we created this thing
called rearview
-
which is now opensource
-
which allows you do to aberration detection
-
and aberration means strange behavior, strange
change in behavior
-
so rearview can do aberration detection
-
on data sets, arbitrary data sets
-
which means, like in the living social world
-
we had user sign ups
-
constantly streaming in
-
it was a very high volume site
-
if user sign-ups were weird
-
we would get an alert
-
why might they be weird?
-
one thing could be like the user service is
down, right
-
so then we would get two alerts
-
user sign ups have gone down
-
and so has the service
-
so obviously the problem is the service is
down
-
let's bring it back up
-
but it could be something like
-
a front-end developer or a designer
-
made a change that was intentional
-
but it just didn't work and no one liked it
-
so they didn't sign up to the site anymore
-
that's more important than just knowing that
the service is down
-
right, because what you care about
-
isn't that the service is up or down
-
if you could crash the entire system and still
be making money
-
you don't care, right, that's better
-
throw it away and stop paying for the servers
-
but if your system is up 100% of the time
and performs excellently
-
but no one's using it, that's bad
-
so monitoring business metrics gives you a
lot more than unit test could ever give you
-
and then in our world
-
we focused on experiencing
-
no, you have to come up to front and say ten!
-
ok, ten minutes left
-
when I got to 6WunderKinder in Berlin
-
everyone was terrified to touch the system
-
because they hadn't created a really well-designed
-
but traditional monolithic API
-
so they had layers of abstractions
-
it was all kind of in one big thing
-
they had a huge database
-
and they were really, really scared to do
anything
-
so there's like one person who would deploy
anything
-
and everyone else was trying to work on other
projects
-
and not touch it
-
but it was like the production system
-
you know so it wasn't really an option
-
so the first thing I did in my first week
-
is I got these graphs going
-
and this was, yeah, response time
-
and the first thing I did is I started turning
off servers
-
and just watching the graphs
-
and then, as I was turning off the servers
-
I went to the production database
-
and I did select, count, star from tasks
-
and we're a task management app
-
so we have hundreds of millions of tasks
-
and the whole thing crashed
-
and all the people were like AAAAH what's
going on
-
you know, and I said, it's no problem
-
I did this on purpose, I'll just make it come
back
-
which I did
-
and from that point on
-
like, really every day I would do something
-
which basically crash the system for just
a moment
-
and really, like, we had way too many servers
in production
-
we were spending tens of thousands more Euros
per month
-
than we should have on the infrastructure
-
and I just started taking things away
-
and I would usually do it
-
instead of the responsible way,
-
like one server at a time
-
I would just remove all of them and start
adding them back
-
so for a moment everything was down
-
but after that we go to a point where
-
everyone on the team was absolutely comfortable
-
with the worst case scenario
-
of the system being completely down
-
so that we could, in a panic free way
-
just focus on bringing it up when it was bad
-
so now when you do a deployment
-
and you have your business metrics being measured
-
you know the important stuff is happening
-
and you know what to do when everything is
down
-
you've experienced the worst thing that can
happen
-
well the worst thing is like someone breaks
in
-
and steals all your stuff, steals all your
users' phone numbers
-
and posts them online like SnapChat or something
-
but you've experienced all these potentially
horrible things
-
and realized, eh, it's not so bad, I can deal
with this
-
I know what do to
-
it allows you to start making bold moves
-
and that's what we all want right
-
we all want to be able to bravely go into
our systems
-
and do anything we think is right
-
so that's what I've been focusing on
-
we also do this thing called Canary in the
Coal Mine deployments
-
which removes the fear, also
-
canary in the coalmine refers to a kind of
sad thing
-
about coal miners in the US
-
where they would send canaries into the mines
-
at various levels
-
and if the canary died they knew there was
a problem
-
with the air
-
but in the software world
-
what this means is you have bunch of servers
running
-
or a bunch of, I don't know, clients running
a certain version
-
and you start introducing new version incrementally
-
and watching the effects
-
so once you're measuring everything
-
and monitoring everything
-
you can also start doing these canary in the
coalmine things
-
where you say OK I have a new version of this
service
-
that I'm going to deploy
-
and I've got thirty servers running for it
-
but I'm going to change only five of them
now
-
and see, like, does my error rate increase
-
or does my performance drop on those servers
-
or do people actually not successfully complete
the task they're trying to do
-
on those servers
-
so, this also allows us the combination of
monitoring everything
-
and these immutable deployments and everything
-
gives us the ability to gradually affect change
and not be afraid
-
so we roll out changes all day every day
-
because we don't fear that we're just going
to destroy the entire system all at once
-
so I think I have like five minutes left
-
uh, these are some things we're not necessarily
doing yet
-
but they're some ideas that I have
-
that given some free time I will work on
-
and, they're probably more exciting
-
one is I talked about homeostatic regulation
-
and homeostasis
-
so I think we all understand the idea of you
know homeostasis
-
and the fact that systems have different parts
that do different roles
-
and can protect each other from each other
-
but, so this diagram is actually just some
random diagram
-
I copied and pasted off the AWS website
-
so it's not necessarily all that meaningful
-
except to show that every architecture
-
especially server based architectures
-
has a collection of services that play different
roles
-
and it almost looks like a person
-
you've got a brain and a heart and a liver
-
and all these things, right
-
what would it mean to actually implement
-
homeostatic regulation in a web service?
-
so that you have some controlling system
-
where the database will actually kill an app
server
-
that is hurting it, for example
-
just kill it
-
I don't know yet, I don't know what that is
-
but some ideas about this stuff
-
I don't know if you've heard of these
-
NetFlix, do you have NetFlix in India yet?
-
probably not, unless you have a VPN, right
-
NetFlix has a really great cloud based architecture
-
they have this thing called Chaos Monkey they've
created
-
which goes through their system and randomly
destroys Nodes
-
just crashes servers
-
and they did this because, when they were,
they were early users of AWS
-
and when they went out initially with AWS,
servers were crashing
-
like it was still immature
-
so they said OK we still want to use this
-
and we'll build in stuff so that we can deal
with the crashes
-
but we have to know it's gonna work when it
crashes
-
so let's make crashing be part of production
-
so they actually have gotten really sophisticated
now
-
and they will crash entire regions
-
cause they're in multiple data centers
-
so they'll say like, what would happen if
this
-
data center went down, does the site still
stay up?
-
and they do this in production all the time
-
like they're crashing servers right now
-
it's really neat
-
another one that is inspirational in this
way
-
is Pinterest, they use AWS as well
-
and they have, AWS has this thing called Spot
Instances
-
and I won't go into too much detail
-
because I don't have time
-
but Spot Instances allow you to effectively
-
bid on servers at a price that you are willing
to pay
-
so like if a usual server costs $0.20 per
minute
-
you can say, I'll give $0.15 per minute
-
and when excess capacity comes open
-
it's almost like a stock market
-
if $0.15 is the going price, you'll get a
server
-
and it starts up and it runs what you want
-
but here's the cool thing
-
if the stock market goes and the price goes
higher than you're willing to pay
-
Amazon will just turn off those servers
-
they're just dead, you don't have any warning
-
they're just dead
-
so Pinterest uses this for their production
servers
-
which means they save a lot of money
-
they're paying way under the average Amazon
cost for hosting
-
but the really cool thing in my opinion
-
is not the money they save but the fact that
-
like, what would you have to do to build a
full system
-
where any node can and will die at any moment
-
and it's not even under your control
-
that's really exciting
-
so a simple thing you can do for homeostasis
though
-
is you can just adjust
-
so in our world we have multiple nodes
-
and all these little services
-
we can scale each one independently
-
we're measuring everything
-
so Amazon has a thing called Auto Scaling
-
we don't use it, we do our own scaling
-
and we just do it based on volume and performance
-
now when you have a bunch of services like
this
-
like, I don't know, maybe we have fifty different
services now
-
that each play tiny little roles
-
it becomes difficult to figure out, like,
where things are
-
so we've started implementing zookeeper for
service resolution
-
which means a service can come online and
say
-
I'm the reminder service version 2.3
-
and then tell a central guardian
-
and the zookeeper can then route traffic to
it
-
probably too detailed for now
-
I'm gonna skip over some stuff real quick
-
but I want to talk about this one
-
if, did the Nordic Ruby, no, Nordic Ruby talks
never go online
-
so you can never see this talk
-
sorry
-
at Nordic Ruby Reginald Braithwaite did a
really cool talk
-
on like challenges of the Ruby language
-
and he made this statement
-
Ruby has beautiful but static coupling
-
which was really strange
-
but basically he was making the same point
that
-
I was talking about earlier
-
that, like Ruby creates a bunch of ways that
you can couple
-
your system together
-
that kind of screw you in the end
-
but they're really beautiful to use
-
but, like, Ruby can really lead to some deep
crazy coupling
-
and so he presented this idea of bind by contract
-
and bind by contract, in a Ruby sense
-
would be, like, I have a class that has a
method
-
that takes these parameters under these conditions
-
and I can kind of put it into my VM
-
and whenever someone needs to have a functionality
like that
-
it will be automatically bound together
-
by the fact that it can do that thing
-
and instead of how we tend to use Ruby and
Java and other languages
-
I have a class with a method name I'm going
to call it
-
right, that's coupling
-
but he proposed this idea of this decoupled
system
-
where you just say I need a functionality
like this
-
that works under the conditions that I have
present
-
so this lead me to this idea
-
and this may be like way too weird, I don't
know
-
what if in your web application your routes
file
-
for your services read like a functional pattern
matching syntax
-
so like if you've ever used Erlang or Haskell
or Scala
-
any of these things that have functional pattern
matching
-
what if you could then route to different
services
-
across a bunch of different services
-
based on contract
-
now I have zero time left
-
but I'm just gonna keep talking, cause I'm
mean
-
oh wait I'm not allowed to be mean
-
because of the code of contact
-
so I'll wrap up
-
so this is an idea that I've started working
on as well
-
where I would actually write an Erlang service
-
with this sort of functional pattern matching
-
but have it be routing in really fast real
time
-
through back end services that support it
-
one more thing I just want to show you real
quick
-
that I am working on and I want to show you
-
because I want you to help me
-
has anyone used JSON schema?
-
OK, you people are my friends for the rest
of the conference
-
in a system where you have all these things
talking to each other
-
you do need a way to validate the inputs and
outputs
-
but I don't want to generate code that parses
and creates JSON
-
I don't want to do something in real time
that intercepts my
-
kind of traffic, so there's this thing called
JSON schema
-
that allows you to, in a completely decoupled
way
-
specify JSON documents and how they should
interact
-
and I am working on a new thing that's called
Klagen
-
which is the German word for complain
-
it's written in Scala, so if anyone wants
to pair up on some Scala stuff
-
what it will be is a high performance asynchronous
JSON schema validation middleware
-
so if that's interesting to anyone, even if
you don't know Scala or JSON schema
-
please let me know
-
and I believe I'm out of time so I'm just
gonna end there
-
am I right? I'm right, yes
-
so thank you very much, and let's talk during
the conference