Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)

Edit subtitles

0:16 - 0:18

SANDY VANDERBLEEK: OK. So I'm gonna talk
0:19 - 0:23

about how to write your own operations framework,
0:23 - 0:26

if you really have to. And that's the first
question.
0:26 - 0:30

Do you really have to? No. You don't. Other
0:30 - 0:34

people have written lots of different things,
and if
0:34 - 0:37

you're working in Ruby, Chef is pretty great
and
0:37 - 0:41

it'll get you really far. So, and you know,
0:41 - 0:43

if you have to roll your own thing, it
0:43 - 0:47

doesn't mean you can't keep using Chef. We're
using
0:47 - 0:53

Chef, and the problems we're trying to solve
are-
0:53 - 0:55

We want something that is very testable. That
is
0:55 - 1:00

our first key desirable. And we need to be
1:00 - 1:03

able to test things rapidly at the unit level
1:03 - 1:07

and then integrate it, and then the biggest
area
1:07 - 1:09

of testing is acceptance testing.
1:09 - 1:13

We want to have our framework bring up real
1:13 - 1:16

machines and make assertions about the state
of those
1:16 - 1:20

machines. And we want to know that our framework
1:20 - 1:22

can bring up the machine's - you know, we
1:22 - 1:26

want continuous test running so we know we
can
1:26 - 1:28

deploy our infrastructure.
1:28 - 1:30

So we know we can build our infrastructure
at
1:30 - 1:32

all times, so at, you know, three in the
1:32 - 1:35

morning when things are going bad, we don't
also
1:35 - 1:40

have to deal with debugging our deployment
code. So
1:40 - 1:42

that is the main desirable.
1:42 - 1:47

Right, before we decided to write our own
framework
1:47 - 1:51

at the company I'm working at, we've been
using
1:51 - 1:56

CloudWatch on AWBS. Not CloudWatch. Cloud
Formation Templates. And
1:56 - 2:00

stuff like that. They're slow and it was hard,
2:00 - 2:03

they're hard to test. So just, designing this
framework
2:03 - 2:08

with tests first is a big, big thing.
2:08 - 2:11

So, and then the, the rest of the standard
2:11 - 2:13

stuff. We need to be available. Everything
for us
2:13 - 2:16

is based around high availability. Everything
is pretty much
2:16 - 2:20

a cluster with a load balancer. So we have
2:20 - 2:21

no single point of failure.
2:21 - 2:23

And this has to apply to our, what's running
2:23 - 2:27

our, you know, operations framework, too.
The, the servers
2:27 - 2:30

that our ops is running on need to be
2:30 - 2:33

high availa- highly available, too. So it
needs to
2:33 - 2:37

bootstrap itself from a local machine into
a cluster
2:37 - 2:40

mode, and run like that.
2:40 - 2:45

Debugability is big. When a deployment fails
we need
2:45 - 2:47

to get on our machines, see why it fails,
2:47 - 2:50

see what's going on. The whole system is based
2:50 - 2:52

on the system of swapping out an already running
2:52 - 2:55

cluster with the new, newly deployed custer.
So, and
2:55 - 2:58

we, we want to keep that old cluster around,
2:58 - 3:00

if you can, I mean that'd be nice, in
3:00 - 3:02

case the new one has some problems - just
3:02 - 3:04

swap back.
3:04 - 3:07

So, also audit-able is a big thing. We want
3:07 - 3:13

to know what's going on every transition of
our
3:13 - 3:16

operations resource, we want to know why it
triggered
3:16 - 3:18

and it wasn't successful. You know, how long
has
3:18 - 3:20

it taken, we want to know average deploy times,
3:20 - 3:25

average fail times, et cetera. Want to be
able
3:25 - 3:27

to run those analytics.
3:27 - 3:31

So what's wrong with just Chef? So Chef server
3:31 - 3:32

is what you would go to to kind of
3:32 - 3:35

provide this whole, you know, framework for
managing your
3:35 - 3:39

settings, managing your machines, doing service
discovery, stuff like
3:39 - 3:46

that. And to make that highly available is
non-trivial.
3:46 - 3:49

And you know Chef-spec, the testing, you know
that,
3:49 - 3:50

one of the testing libraries that is out there
3:50 - 3:54

for it, it's just pretty much a unit test.
3:54 - 3:57

It just doesn't actually do anything. It just
tests
3:57 - 4:00

your code, and you know, can it run.
4:00 - 4:02

And there are some other tools out there,
like
4:02 - 4:09

Cucumber-Chef, which it comes from this bug
test-driven infrastructure
4:09 - 4:13

using chef, and you get to write cucumber
tests
4:13 - 4:17

that say, like, real machines, it deploys
actual AWS
4:17 - 4:21

resources and then runs your tests on those
servers.
4:21 - 4:25

So that's a pretty cool tool. It's, it's not
4:25 - 4:29

really under active development right now,
and it's not
4:29 - 4:33

quite flexible for what we wanted to do. And
4:33 - 4:35

something really cool that's coming out with
the same
4:35 - 4:37

people who did Chef-Spec is Test Kitchen.
4:37 - 4:39

I don't know if anyone's seen this, but it,
4:39 - 4:42

it definitely is what we're aiming to do from
4:42 - 4:47

the testing standpoint. So, but it's under
wraps and
4:47 - 4:48

all that.
4:48 - 4:51

So here are the components that I came up
4:51 - 4:54

with for this framework. It's all API based.
So
4:54 - 4:58

we have ops, basically, as a API service where,
4:58 - 5:02

and, just, developers can make API calls from
a,
5:02 - 5:06

from Perl or whatever. Or there's also a front-end
5:06 - 5:09

component that I built in JavaScript.
5:09 - 5:14

So the API is like, you know, access, it's
5:14 - 5:17

just your, your single point of control. Then
all
5:17 - 5:20

the, you know, our business logic is in the
5:20 - 5:24

domain. The domain layer. And that's things
like, what
5:24 - 5:28

we consider part of our deployment process.
We built
5:28 - 5:33

images, we deploy clusters, we have some settings
and
5:33 - 5:37

users with permissions. So that's basically
our domain.
5:37 - 5:39

And then the ops is like the whole meat
5:39 - 5:42

of it. This is all the nasty stuff, working
5:42 - 5:46

with your cloud library, working with, you
know, Unix,
5:46 - 5:48

getting all of the things done that you need
5:48 - 5:51

to get done to make your domain a reality.
5:51 - 5:56

We have a database for persistence. We're
using MongoDB
5:56 - 6:01

in a cluster. So the database isn't that important.
6:01 - 6:04

Whatever you're gonna use just needs to be
reliable.
6:04 - 6:07

And then the frontend - it can be a
6:07 - 6:12

app, a command line, whatever. That's why
you make
6:12 - 6:15

the API, so you have that flexibility at the
6:15 - 6:16

front-end level.
6:16 - 6:20

So the API - we have end points. It's
6:20 - 6:24

just rack, so it's pretty simple. We're using
Grape.
6:24 - 6:29

Grape is really nice for writing quick APIs
in
6:29 - 6:34

Ruby. Grape has entities which are map, domain
objects
6:34 - 6:38

to the JSON representation. We're just using
JSON and
6:38 - 6:40

JSON out.
6:40 - 6:43

So and then when we consume a representation
that
6:43 - 6:46

a client has messed with, it's called a representation
6:46 - 6:49

right now which is not a great name. But
6:49 - 6:52

that's to take a representation and go back
into
6:52 - 6:58

the domain layer from there. Then services
kind of
6:58 - 6:59

act as the interface to the domain and ops
6:59 - 7:03

layer for the API, so it's not highly coupled
7:03 - 7:05

to what's going on there.
7:05 - 7:09

And the client's, the API provides some clients,
just
7:09 - 7:13

rest clients basically. Everything is restful.
So there are,
7:13 - 7:15

there are a couple of client objects in the
7:15 - 7:17

API layer that you could use for a command
7:17 - 7:20

line inside the ops.
7:20 - 7:25

Wherever you need it. And executation is a
abstraction
7:25 - 7:28

for the API to say go do something, I'm
7:28 - 7:31

gonna respond to the client, and some work
is
7:31 - 7:32

gonna go on in the background.
7:32 - 7:35

And there's a, I extracted it because there's
a
7:35 - 7:38

couple different ways, just for getting started.
Just want
7:38 - 7:42

to fork, you know, but we use AWS flows
7:42 - 7:44

a lot, to do a lot of our work,
7:44 - 7:46

so. I wanted to make it flexible enough so
7:46 - 7:49

that when we plug it into our flow infrastructure,
7:49 - 7:54

we can run our tasks like that.
7:54 - 7:56

So at the domain layer we have resources which
7:56 - 8:00

have states and logic, and then there are
provisioners,
8:00 - 8:04

which are state machines over these resources.
So states,
8:04 - 8:08

for example, image has a pending state, a
building
8:08 - 8:12

state, a built state, a destroyed state, et
cetera.
8:12 - 8:14

And the provisioner is the state machine that's
gonna
8:14 - 8:18

run through all those states which transitions
using success
8:18 - 8:22

and failure, and all this happens in the background,
8:22 - 8:25

usually when you ask for a resource to change
8:25 - 8:27

state. Provisioner is gonna determine what
it needs to
8:27 - 8:29

do to change that state, and then go about
8:29 - 8:30

doing that.
8:30 - 8:35

And that uses the ops layer, where the providers
8:35 - 8:38

are basically controlled by the provisioners
in the machines.
8:38 - 8:41

They communicate using just success and failure
and pass
8:41 - 8:44

an options hash. And then the ops has a
8:44 - 8:47

lot of, you know, tools to use the cloud
8:47 - 8:49

services, so we can get things done on the
8:49 - 8:50

cloud.
8:50 - 8:52

And very important are the testing tools to
prove
8:52 - 8:54

that the things actually got done on the cloud.
8:54 - 8:57

We want to know processes are running, files,
directories,
8:57 - 9:02

everything is set up. Everything is good.
OK.
9:02 - 9:04

And the database. It's just a database. We
use
9:04 - 9:07

it to store data. There are mappers that map
9:07 - 9:11

the resources to MongoDB and back. It uses
the
9:11 - 9:15

data mapper pattern. Perpetuity is a cool
gem. Right
9:15 - 9:17

now it just works with Mongo, but they're
adding
9:17 - 9:20

a postgres SQL to it.
9:20 - 9:23

So I also looked into Ram, RV, which is
9:23 - 9:28

pretty cool, but definitely not ready for
use. Tried
9:28 - 9:34

to keep the, you know, model persistence out
of
9:34 - 9:39

the domain layer, you know, not ActiveRecord
style. Data
9:39 - 9:42

mapper is a, is the pattern. It's in patterns
9:42 - 9:46

of enterprise architecture. If you haven't
heard of it,
9:46 - 9:48

definitely check it out.
9:48 - 9:53

So the resources transition between states.
Transitions are also
9:53 - 9:56

resourced. This is part of the audibility.
You want
9:56 - 10:01

to know, you know, every transition. So our
resources
10:01 - 10:05

are image clusters, settings, users, permissions,
right now. It's
10:05 - 10:08

pretty simple.
10:08 - 10:14

The providers are the implementations of each
resource state.
10:14 - 10:16

So this is in the ops layer. And you
10:16 - 10:18

write a provider, you'll write, like a method
called
10:18 - 10:22

build, if you're image provider, and then
you'll have
10:22 - 10:27

like a method pending, build pending, and
just, that
10:27 - 10:30

method is called when that resource is gonna
go
10:30 - 10:33

into that state. So you need to do everything
10:33 - 10:35

that will make that resource in the state,
and
10:35 - 10:38

then say success inside the provider if you,
you
10:38 - 10:40

know, if you achieved it.
10:40 - 10:43

And then the provisioner will actually update
the client
10:43 - 10:46

and let the resource know that it is in
10:46 - 10:51

that state at the API level. So the provisioner
10:51 - 10:54

is just the control object. It knows about
the
10:54 - 10:57

client and the only transition events are
success and
10:57 - 10:58

failure.
10:58 - 11:03

So it runs inside an execution with the run
11:03 - 11:06

ID as the transition. So a nice feature that
11:06 - 11:08

we don't have yet will be to take that
11:08 - 11:14

run ID and cancel, cancel transitions.
11:14 - 11:17

So another explaine flow, for images for us
is,
11:17 - 11:20

we start in pending, we set up our, our
11:20 - 11:24

image on AWS. We go into a build_pending and
11:24 - 11:28

we, you know, run our, we install Ruby on
11:28 - 11:29

it. We do everything we want to have the
11:29 - 11:34

image set up. It takes awhile. And, well that's
11:34 - 11:36

actually the building state, sorry.
11:36 - 11:38

And then when it's built, we, you know, make
11:38 - 11:42

sure it's registered properly and everything
like that. So
11:42 - 11:45

this framework lets us think in terms of state
11:45 - 11:47

machines, which I think is really valuable.
Think about
11:47 - 11:51

state transitions and think about all your
operations resources
11:51 - 11:53

- all your operations, you know, things as
resources
11:53 - 11:55

that have states that are gonna go through
state
11:55 - 12:00

provisions as state transitions as you, you
know, deploy
12:00 - 12:03

things, make things happen.
12:03 - 12:05

So how do I make it, you know, more
12:05 - 12:08

of a framework, something reusable for everybody?
12:08 - 12:11

So right now it's, it's kind of hard. You
12:11 - 12:13

build your own subclass of resource provision
or provider
12:13 - 12:17

for, you know, your domain object, something
you want,
12:17 - 12:19

you know, to act as a operations resource
for
12:19 - 12:22

you. And then you also have to do your
12:22 - 12:25

entity and endpoint service. And write the
mapper for
12:25 - 12:26

the database.
12:26 - 12:30

And so it's really like seven, at least seven
12:30 - 12:34

classes you're gonna create to make on operations
resource.
12:34 - 12:36

So that's pretty hard.
12:36 - 12:40

I've looked into trying to make a DSL to
12:40 - 12:44

build the resource and provisioner, because
they're very related.
12:44 - 12:48

It's basically the states and the state machine.
But
12:48 - 12:50

the state machine is already - I'm using a
12:50 - 12:54

workflow, it's already a state machine DSL.
So, it's
12:54 - 12:58

hard to, to make frameworks on top of frameworks
12:58 - 13:02

sometimes. And lots of fast-level native programming.
13:02 - 13:03

So.
13:03 - 13:06

But it's interesting, and definitely the real
map is
13:06 - 13:10

to do that. There's already a DSL for the
13:10 - 13:13

API. It's great. And the frontend - I made
13:13 - 13:16

some interesting decisions cause I'm a former
front end
13:16 - 13:20

developer. So I'm using EmberJs to just work
with
13:20 - 13:24

JSON. There's no frontend server. It's a static
JavaScript
13:24 - 13:29

app, self-contained. So would people be interested
in that?
13:29 - 13:33

It's kind of crazy. Also Emblem is a templating
13:33 - 13:37

language, and it's lime HML with handlebars.
It's kind
13:37 - 13:40

of cool. Some cool stuff. So the goal is
13:40 - 13:44

definitely to opensource, you know, the work
we've done,
13:44 - 13:48

and of course, profit. So what are our key
13:48 - 13:50

process benefits from this?
13:50 - 13:52

So we write acceptance tests using RSpect
matchers. They
13:52 - 13:55

run on the instances created by the API, that
13:55 - 13:59

is really big. Cause we, we could have had
13:59 - 14:02

a broken deployment for weeks, and had no
idea
14:02 - 14:06

previously. You know, cause we weren't constanly
testing our
14:06 - 14:09

deployment infrastructure.
14:09 - 14:11

So when a deployment fails, we have SSH to
14:11 - 14:13

access the machine. We have a one stop shop
14:13 - 14:18

for settings and service discovery. Fail overs
is a
14:18 - 14:22

fundamental construct. Swapping clusters back
and forth.
14:22 - 14:25

And it is self-documenting, which is pretty
cool, using
14:25 - 14:29

Grape. You write a couple descriptions of
your end
14:29 - 14:32

points, of your, you know, gets and puts,
and
14:32 - 14:36

then I made a, an endpoint to represent the
14:36 - 14:38

endpoints, actually. So there's an entity
for the endpoint.
14:38 - 14:41

So the actual API endpoints can be, output
is
14:41 - 14:44

JSON representations, and then you can ask
for documentation
14:44 - 14:46

is JSON.
14:46 - 14:50

So that was pretty cool. But, yeah, it's not
14:50 - 14:53

done yet. Lots of work. And right now it's
14:53 - 14:55

kind of monolithic, which is a little bit
a
14:55 - 14:58

problem because we need to integrate lots
of different
14:58 - 15:02

tools that ops developers are building, and
they don't
15:02 - 15:05

all fit or, you know, they already work, and
15:05 - 15:08

how do we in them into this provisioner or
15:08 - 15:10

provider model. It's a little heavy weight
for some
15:10 - 15:12

lightweight tools.
15:12 - 15:16

So I'm gonna show a couple examples of the
15:16 - 15:18

frontend.
15:18 - 15:23

So bootstrap three. It's very nice and clean.
We
15:23 - 15:27

have the resource dates on the clusters, pending
down
15:27 - 15:30

up, some actions and we have a little menu
15:30 - 15:34

to go through our deployments. Some of these,
these
15:34 - 15:37

are just mocks, basically. The whole thing's
not working.
15:37 - 15:41

We want monitors eventually. We get paged
a lot
15:41 - 15:44

and sometimes we don't know if, well if we
15:44 - 15:46

got paged or if it was just transient.
15:46 - 15:47

So we want a page to look at really
15:47 - 15:51

quick, you know, just to see if, basically
a
15:51 - 15:53

sanity check to see if, do we really need
15:53 - 15:54

to, you know, get up at three in the
15:54 - 15:57

morning, get on the computer and get on these
15:57 - 15:59

servers and see what's up.
15:59 - 16:02

So, oh yeah. And the nice thing about using
16:02 - 16:04

this with bootstrap is it's definitely gonna
work on
16:04 - 16:09

a mobile phone, so you'll be able to locate
16:09 - 16:14

it on your phone in bed. Yeah.
16:14 - 16:16

And this will change our life. This is a
16:16 - 16:18

big pain point for us is how we manage
16:18 - 16:21

our settings. We run Chef solo, right now
we
16:21 - 16:24

don't use Chef's server, and our settings
are in
16:24 - 16:27

a bunch of S3 buckets. We have rigged tools
16:27 - 16:29

you know to update all our buckets, but it's
16:29 - 16:31

definitely not the easiest to visualize. So.
16:31 - 16:35

It's gonna help us a lot.
16:35 - 16:38

So here's some of the, some of my inspiration
16:38 - 16:42

while doing this. Test-driven infrastructure
with Chef. It's a
16:42 - 16:45

really quick read, it's like 70 pages. If
you're
16:45 - 16:49

interested in, you know, testing, your deployment
process, check
16:49 - 16:52

that out. It's kind of hand wave-y, but, there
16:52 - 16:55

is the code, there is the Chef, Cucumber Chef.
16:55 - 16:57

That the guy who wrote the book wrote, so
16:57 - 16:59

you can check that out too.
16:59 - 17:03

DevOps Weekly is a great, great newsletter.
I pretty
17:03 - 17:05

much read every week. They bring up some really
17:05 - 17:09

cool tools and things people are working on.
Just
17:09 - 17:12

release it is a very cool book in the
17:12 - 17:17

pragmatic programmer series or whoever releases
that. It's, it's
17:17 - 17:20

Java based, but it's, it's all about, you
know,
17:20 - 17:24

handling failure, and how important failure
is a concept
17:24 - 17:25

to operations.
17:25 - 17:28

And, of course, you know, when you're building
a
17:28 - 17:31

framework and you're really trying to find
these, this
17:31 - 17:36

structure, patterns of enterprise architecture
is a classic, and
17:36 - 17:39

Growing Object Oriented Software Guided By
Tests, two really
17:39 - 17:42

great books, you know. The whole idea is to
17:42 - 17:45

you know start with your unit tests, then
write
17:45 - 17:46

your class, so.
17:46 - 17:48

It's good.
17:48 - 17:53

That was actually pretty quick. So does anyone
have
17:53 - 17:54

any questions?
17:55 - 17:59

I'm kind of done. Thanks.

Title:: Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)
Description:: more » « less
Duration:: 18:24

Amara Bot edited English subtitles for Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)

English subtitles

Revisions

Revision 1 Imported

Amara Bot

Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)

Revisions

Our website uses cookies

Operating cookies (Required)