< Return to Video

Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)

  • 0:16 - 0:18
    SANDY VANDERBLEEK: OK. So I'm gonna talk
  • 0:19 - 0:23
    about how to write your own operations framework,
  • 0:23 - 0:26
    if you really have to. And that's the first
    question.
  • 0:26 - 0:30
    Do you really have to? No. You don't. Other
  • 0:30 - 0:34
    people have written lots of different things,
    and if
  • 0:34 - 0:37
    you're working in Ruby, Chef is pretty great
    and
  • 0:37 - 0:41
    it'll get you really far. So, and you know,
  • 0:41 - 0:43
    if you have to roll your own thing, it
  • 0:43 - 0:47
    doesn't mean you can't keep using Chef. We're
    using
  • 0:47 - 0:53
    Chef, and the problems we're trying to solve
    are-
  • 0:53 - 0:55
    We want something that is very testable. That
    is
  • 0:55 - 1:00
    our first key desirable. And we need to be
  • 1:00 - 1:03
    able to test things rapidly at the unit level
  • 1:03 - 1:07
    and then integrate it, and then the biggest
    area
  • 1:07 - 1:09
    of testing is acceptance testing.
  • 1:09 - 1:13
    We want to have our framework bring up real
  • 1:13 - 1:16
    machines and make assertions about the state
    of those
  • 1:16 - 1:20
    machines. And we want to know that our framework
  • 1:20 - 1:22
    can bring up the machine's - you know, we
  • 1:22 - 1:26
    want continuous test running so we know we
    can
  • 1:26 - 1:28
    deploy our infrastructure.
  • 1:28 - 1:30
    So we know we can build our infrastructure
    at
  • 1:30 - 1:32
    all times, so at, you know, three in the
  • 1:32 - 1:35
    morning when things are going bad, we don't
    also
  • 1:35 - 1:40
    have to deal with debugging our deployment
    code. So
  • 1:40 - 1:42
    that is the main desirable.
  • 1:42 - 1:47
    Right, before we decided to write our own
    framework
  • 1:47 - 1:51
    at the company I'm working at, we've been
    using
  • 1:51 - 1:56
    CloudWatch on AWBS. Not CloudWatch. Cloud
    Formation Templates. And
  • 1:56 - 2:00
    stuff like that. They're slow and it was hard,
  • 2:00 - 2:03
    they're hard to test. So just, designing this
    framework
  • 2:03 - 2:08
    with tests first is a big, big thing.
  • 2:08 - 2:11
    So, and then the, the rest of the standard
  • 2:11 - 2:13
    stuff. We need to be available. Everything
    for us
  • 2:13 - 2:16
    is based around high availability. Everything
    is pretty much
  • 2:16 - 2:20
    a cluster with a load balancer. So we have
  • 2:20 - 2:21
    no single point of failure.
  • 2:21 - 2:23
    And this has to apply to our, what's running
  • 2:23 - 2:27
    our, you know, operations framework, too.
    The, the servers
  • 2:27 - 2:30
    that our ops is running on need to be
  • 2:30 - 2:33
    high availa- highly available, too. So it
    needs to
  • 2:33 - 2:37
    bootstrap itself from a local machine into
    a cluster
  • 2:37 - 2:40
    mode, and run like that.
  • 2:40 - 2:45
    Debugability is big. When a deployment fails
    we need
  • 2:45 - 2:47
    to get on our machines, see why it fails,
  • 2:47 - 2:50
    see what's going on. The whole system is based
  • 2:50 - 2:52
    on the system of swapping out an already running
  • 2:52 - 2:55
    cluster with the new, newly deployed custer.
    So, and
  • 2:55 - 2:58
    we, we want to keep that old cluster around,
  • 2:58 - 3:00
    if you can, I mean that'd be nice, in
  • 3:00 - 3:02
    case the new one has some problems - just
  • 3:02 - 3:04
    swap back.
  • 3:04 - 3:07
    So, also audit-able is a big thing. We want
  • 3:07 - 3:13
    to know what's going on every transition of
    our
  • 3:13 - 3:16
    operations resource, we want to know why it
    triggered
  • 3:16 - 3:18
    and it wasn't successful. You know, how long
    has
  • 3:18 - 3:20
    it taken, we want to know average deploy times,
  • 3:20 - 3:25
    average fail times, et cetera. Want to be
    able
  • 3:25 - 3:27
    to run those analytics.
  • 3:27 - 3:31
    So what's wrong with just Chef? So Chef server
  • 3:31 - 3:32
    is what you would go to to kind of
  • 3:32 - 3:35
    provide this whole, you know, framework for
    managing your
  • 3:35 - 3:39
    settings, managing your machines, doing service
    discovery, stuff like
  • 3:39 - 3:46
    that. And to make that highly available is
    non-trivial.
  • 3:46 - 3:49
    And you know Chef-spec, the testing, you know
    that,
  • 3:49 - 3:50
    one of the testing libraries that is out there
  • 3:50 - 3:54
    for it, it's just pretty much a unit test.
  • 3:54 - 3:57
    It just doesn't actually do anything. It just
    tests
  • 3:57 - 4:00
    your code, and you know, can it run.
  • 4:00 - 4:02
    And there are some other tools out there,
    like
  • 4:02 - 4:09
    Cucumber-Chef, which it comes from this bug
    test-driven infrastructure
  • 4:09 - 4:13
    using chef, and you get to write cucumber
    tests
  • 4:13 - 4:17
    that say, like, real machines, it deploys
    actual AWS
  • 4:17 - 4:21
    resources and then runs your tests on those
    servers.
  • 4:21 - 4:25
    So that's a pretty cool tool. It's, it's not
  • 4:25 - 4:29
    really under active development right now,
    and it's not
  • 4:29 - 4:33
    quite flexible for what we wanted to do. And
  • 4:33 - 4:35
    something really cool that's coming out with
    the same
  • 4:35 - 4:37
    people who did Chef-Spec is Test Kitchen.
  • 4:37 - 4:39
    I don't know if anyone's seen this, but it,
  • 4:39 - 4:42
    it definitely is what we're aiming to do from
  • 4:42 - 4:47
    the testing standpoint. So, but it's under
    wraps and
  • 4:47 - 4:48
    all that.
  • 4:48 - 4:51
    So here are the components that I came up
  • 4:51 - 4:54
    with for this framework. It's all API based.
    So
  • 4:54 - 4:58
    we have ops, basically, as a API service where,
  • 4:58 - 5:02
    and, just, developers can make API calls from
    a,
  • 5:02 - 5:06
    from Perl or whatever. Or there's also a front-end
  • 5:06 - 5:09
    component that I built in JavaScript.
  • 5:09 - 5:14
    So the API is like, you know, access, it's
  • 5:14 - 5:17
    just your, your single point of control. Then
    all
  • 5:17 - 5:20
    the, you know, our business logic is in the
  • 5:20 - 5:24
    domain. The domain layer. And that's things
    like, what
  • 5:24 - 5:28
    we consider part of our deployment process.
    We built
  • 5:28 - 5:33
    images, we deploy clusters, we have some settings
    and
  • 5:33 - 5:37
    users with permissions. So that's basically
    our domain.
  • 5:37 - 5:39
    And then the ops is like the whole meat
  • 5:39 - 5:42
    of it. This is all the nasty stuff, working
  • 5:42 - 5:46
    with your cloud library, working with, you
    know, Unix,
  • 5:46 - 5:48
    getting all of the things done that you need
  • 5:48 - 5:51
    to get done to make your domain a reality.
  • 5:51 - 5:56
    We have a database for persistence. We're
    using MongoDB
  • 5:56 - 6:01
    in a cluster. So the database isn't that important.
  • 6:01 - 6:04
    Whatever you're gonna use just needs to be
    reliable.
  • 6:04 - 6:07
    And then the frontend - it can be a
  • 6:07 - 6:12
    app, a command line, whatever. That's why
    you make
  • 6:12 - 6:15
    the API, so you have that flexibility at the
  • 6:15 - 6:16
    front-end level.
  • 6:16 - 6:20
    So the API - we have end points. It's
  • 6:20 - 6:24
    just rack, so it's pretty simple. We're using
    Grape.
  • 6:24 - 6:29
    Grape is really nice for writing quick APIs
    in
  • 6:29 - 6:34
    Ruby. Grape has entities which are map, domain
    objects
  • 6:34 - 6:38
    to the JSON representation. We're just using
    JSON and
  • 6:38 - 6:40
    JSON out.
  • 6:40 - 6:43
    So and then when we consume a representation
    that
  • 6:43 - 6:46
    a client has messed with, it's called a representation
  • 6:46 - 6:49
    right now which is not a great name. But
  • 6:49 - 6:52
    that's to take a representation and go back
    into
  • 6:52 - 6:58
    the domain layer from there. Then services
    kind of
  • 6:58 - 6:59
    act as the interface to the domain and ops
  • 6:59 - 7:03
    layer for the API, so it's not highly coupled
  • 7:03 - 7:05
    to what's going on there.
  • 7:05 - 7:09
    And the client's, the API provides some clients,
    just
  • 7:09 - 7:13
    rest clients basically. Everything is restful.
    So there are,
  • 7:13 - 7:15
    there are a couple of client objects in the
  • 7:15 - 7:17
    API layer that you could use for a command
  • 7:17 - 7:20
    line inside the ops.
  • 7:20 - 7:25
    Wherever you need it. And executation is a
    abstraction
  • 7:25 - 7:28
    for the API to say go do something, I'm
  • 7:28 - 7:31
    gonna respond to the client, and some work
    is
  • 7:31 - 7:32
    gonna go on in the background.
  • 7:32 - 7:35
    And there's a, I extracted it because there's
    a
  • 7:35 - 7:38
    couple different ways, just for getting started.
    Just want
  • 7:38 - 7:42
    to fork, you know, but we use AWS flows
  • 7:42 - 7:44
    a lot, to do a lot of our work,
  • 7:44 - 7:46
    so. I wanted to make it flexible enough so
  • 7:46 - 7:49
    that when we plug it into our flow infrastructure,
  • 7:49 - 7:54
    we can run our tasks like that.
  • 7:54 - 7:56
    So at the domain layer we have resources which
  • 7:56 - 8:00
    have states and logic, and then there are
    provisioners,
  • 8:00 - 8:04
    which are state machines over these resources.
    So states,
  • 8:04 - 8:08
    for example, image has a pending state, a
    building
  • 8:08 - 8:12
    state, a built state, a destroyed state, et
    cetera.
  • 8:12 - 8:14
    And the provisioner is the state machine that's
    gonna
  • 8:14 - 8:18
    run through all those states which transitions
    using success
  • 8:18 - 8:22
    and failure, and all this happens in the background,
  • 8:22 - 8:25
    usually when you ask for a resource to change
  • 8:25 - 8:27
    state. Provisioner is gonna determine what
    it needs to
  • 8:27 - 8:29
    do to change that state, and then go about
  • 8:29 - 8:30
    doing that.
  • 8:30 - 8:35
    And that uses the ops layer, where the providers
  • 8:35 - 8:38
    are basically controlled by the provisioners
    in the machines.
  • 8:38 - 8:41
    They communicate using just success and failure
    and pass
  • 8:41 - 8:44
    an options hash. And then the ops has a
  • 8:44 - 8:47
    lot of, you know, tools to use the cloud
  • 8:47 - 8:49
    services, so we can get things done on the
  • 8:49 - 8:50
    cloud.
  • 8:50 - 8:52
    And very important are the testing tools to
    prove
  • 8:52 - 8:54
    that the things actually got done on the cloud.
  • 8:54 - 8:57
    We want to know processes are running, files,
    directories,
  • 8:57 - 9:02
    everything is set up. Everything is good.
    OK.
  • 9:02 - 9:04
    And the database. It's just a database. We
    use
  • 9:04 - 9:07
    it to store data. There are mappers that map
  • 9:07 - 9:11
    the resources to MongoDB and back. It uses
    the
  • 9:11 - 9:15
    data mapper pattern. Perpetuity is a cool
    gem. Right
  • 9:15 - 9:17
    now it just works with Mongo, but they're
    adding
  • 9:17 - 9:20
    a postgres SQL to it.
  • 9:20 - 9:23
    So I also looked into Ram, RV, which is
  • 9:23 - 9:28
    pretty cool, but definitely not ready for
    use. Tried
  • 9:28 - 9:34
    to keep the, you know, model persistence out
    of
  • 9:34 - 9:39
    the domain layer, you know, not ActiveRecord
    style. Data
  • 9:39 - 9:42
    mapper is a, is the pattern. It's in patterns
  • 9:42 - 9:46
    of enterprise architecture. If you haven't
    heard of it,
  • 9:46 - 9:48
    definitely check it out.
  • 9:48 - 9:53
    So the resources transition between states.
    Transitions are also
  • 9:53 - 9:56
    resourced. This is part of the audibility.
    You want
  • 9:56 - 10:01
    to know, you know, every transition. So our
    resources
  • 10:01 - 10:05
    are image clusters, settings, users, permissions,
    right now. It's
  • 10:05 - 10:08
    pretty simple.
  • 10:08 - 10:14
    The providers are the implementations of each
    resource state.
  • 10:14 - 10:16
    So this is in the ops layer. And you
  • 10:16 - 10:18
    write a provider, you'll write, like a method
    called
  • 10:18 - 10:22
    build, if you're image provider, and then
    you'll have
  • 10:22 - 10:27
    like a method pending, build pending, and
    just, that
  • 10:27 - 10:30
    method is called when that resource is gonna
    go
  • 10:30 - 10:33
    into that state. So you need to do everything
  • 10:33 - 10:35
    that will make that resource in the state,
    and
  • 10:35 - 10:38
    then say success inside the provider if you,
    you
  • 10:38 - 10:40
    know, if you achieved it.
  • 10:40 - 10:43
    And then the provisioner will actually update
    the client
  • 10:43 - 10:46
    and let the resource know that it is in
  • 10:46 - 10:51
    that state at the API level. So the provisioner
  • 10:51 - 10:54
    is just the control object. It knows about
    the
  • 10:54 - 10:57
    client and the only transition events are
    success and
  • 10:57 - 10:58
    failure.
  • 10:58 - 11:03
    So it runs inside an execution with the run
  • 11:03 - 11:06
    ID as the transition. So a nice feature that
  • 11:06 - 11:08
    we don't have yet will be to take that
  • 11:08 - 11:14
    run ID and cancel, cancel transitions.
  • 11:14 - 11:17
    So another explaine flow, for images for us
    is,
  • 11:17 - 11:20
    we start in pending, we set up our, our
  • 11:20 - 11:24
    image on AWS. We go into a build_pending and
  • 11:24 - 11:28
    we, you know, run our, we install Ruby on
  • 11:28 - 11:29
    it. We do everything we want to have the
  • 11:29 - 11:34
    image set up. It takes awhile. And, well that's
  • 11:34 - 11:36
    actually the building state, sorry.
  • 11:36 - 11:38
    And then when it's built, we, you know, make
  • 11:38 - 11:42
    sure it's registered properly and everything
    like that. So
  • 11:42 - 11:45
    this framework lets us think in terms of state
  • 11:45 - 11:47
    machines, which I think is really valuable.
    Think about
  • 11:47 - 11:51
    state transitions and think about all your
    operations resources
  • 11:51 - 11:53
    - all your operations, you know, things as
    resources
  • 11:53 - 11:55
    that have states that are gonna go through
    state
  • 11:55 - 12:00
    provisions as state transitions as you, you
    know, deploy
  • 12:00 - 12:03
    things, make things happen.
  • 12:03 - 12:05
    So how do I make it, you know, more
  • 12:05 - 12:08
    of a framework, something reusable for everybody?
  • 12:08 - 12:11
    So right now it's, it's kind of hard. You
  • 12:11 - 12:13
    build your own subclass of resource provision
    or provider
  • 12:13 - 12:17
    for, you know, your domain object, something
    you want,
  • 12:17 - 12:19
    you know, to act as a operations resource
    for
  • 12:19 - 12:22
    you. And then you also have to do your
  • 12:22 - 12:25
    entity and endpoint service. And write the
    mapper for
  • 12:25 - 12:26
    the database.
  • 12:26 - 12:30
    And so it's really like seven, at least seven
  • 12:30 - 12:34
    classes you're gonna create to make on operations
    resource.
  • 12:34 - 12:36
    So that's pretty hard.
  • 12:36 - 12:40
    I've looked into trying to make a DSL to
  • 12:40 - 12:44
    build the resource and provisioner, because
    they're very related.
  • 12:44 - 12:48
    It's basically the states and the state machine.
    But
  • 12:48 - 12:50
    the state machine is already - I'm using a
  • 12:50 - 12:54
    workflow, it's already a state machine DSL.
    So, it's
  • 12:54 - 12:58
    hard to, to make frameworks on top of frameworks
  • 12:58 - 13:02
    sometimes. And lots of fast-level native programming.
  • 13:02 - 13:03
    So.
  • 13:03 - 13:06
    But it's interesting, and definitely the real
    map is
  • 13:06 - 13:10
    to do that. There's already a DSL for the
  • 13:10 - 13:13
    API. It's great. And the frontend - I made
  • 13:13 - 13:16
    some interesting decisions cause I'm a former
    front end
  • 13:16 - 13:20
    developer. So I'm using EmberJs to just work
    with
  • 13:20 - 13:24
    JSON. There's no frontend server. It's a static
    JavaScript
  • 13:24 - 13:29
    app, self-contained. So would people be interested
    in that?
  • 13:29 - 13:33
    It's kind of crazy. Also Emblem is a templating
  • 13:33 - 13:37
    language, and it's lime HML with handlebars.
    It's kind
  • 13:37 - 13:40
    of cool. Some cool stuff. So the goal is
  • 13:40 - 13:44
    definitely to opensource, you know, the work
    we've done,
  • 13:44 - 13:48
    and of course, profit. So what are our key
  • 13:48 - 13:50
    process benefits from this?
  • 13:50 - 13:52
    So we write acceptance tests using RSpect
    matchers. They
  • 13:52 - 13:55
    run on the instances created by the API, that
  • 13:55 - 13:59
    is really big. Cause we, we could have had
  • 13:59 - 14:02
    a broken deployment for weeks, and had no
    idea
  • 14:02 - 14:06
    previously. You know, cause we weren't constanly
    testing our
  • 14:06 - 14:09
    deployment infrastructure.
  • 14:09 - 14:11
    So when a deployment fails, we have SSH to
  • 14:11 - 14:13
    access the machine. We have a one stop shop
  • 14:13 - 14:18
    for settings and service discovery. Fail overs
    is a
  • 14:18 - 14:22
    fundamental construct. Swapping clusters back
    and forth.
  • 14:22 - 14:25
    And it is self-documenting, which is pretty
    cool, using
  • 14:25 - 14:29
    Grape. You write a couple descriptions of
    your end
  • 14:29 - 14:32
    points, of your, you know, gets and puts,
    and
  • 14:32 - 14:36
    then I made a, an endpoint to represent the
  • 14:36 - 14:38
    endpoints, actually. So there's an entity
    for the endpoint.
  • 14:38 - 14:41
    So the actual API endpoints can be, output
    is
  • 14:41 - 14:44
    JSON representations, and then you can ask
    for documentation
  • 14:44 - 14:46
    is JSON.
  • 14:46 - 14:50
    So that was pretty cool. But, yeah, it's not
  • 14:50 - 14:53
    done yet. Lots of work. And right now it's
  • 14:53 - 14:55
    kind of monolithic, which is a little bit
    a
  • 14:55 - 14:58
    problem because we need to integrate lots
    of different
  • 14:58 - 15:02
    tools that ops developers are building, and
    they don't
  • 15:02 - 15:05
    all fit or, you know, they already work, and
  • 15:05 - 15:08
    how do we in them into this provisioner or
  • 15:08 - 15:10
    provider model. It's a little heavy weight
    for some
  • 15:10 - 15:12
    lightweight tools.
  • 15:12 - 15:16
    So I'm gonna show a couple examples of the
  • 15:16 - 15:18
    frontend.
  • 15:18 - 15:23
    So bootstrap three. It's very nice and clean.
    We
  • 15:23 - 15:27
    have the resource dates on the clusters, pending
    down
  • 15:27 - 15:30
    up, some actions and we have a little menu
  • 15:30 - 15:34
    to go through our deployments. Some of these,
    these
  • 15:34 - 15:37
    are just mocks, basically. The whole thing's
    not working.
  • 15:37 - 15:41
    We want monitors eventually. We get paged
    a lot
  • 15:41 - 15:44
    and sometimes we don't know if, well if we
  • 15:44 - 15:46
    got paged or if it was just transient.
  • 15:46 - 15:47
    So we want a page to look at really
  • 15:47 - 15:51
    quick, you know, just to see if, basically
    a
  • 15:51 - 15:53
    sanity check to see if, do we really need
  • 15:53 - 15:54
    to, you know, get up at three in the
  • 15:54 - 15:57
    morning, get on the computer and get on these
  • 15:57 - 15:59
    servers and see what's up.
  • 15:59 - 16:02
    So, oh yeah. And the nice thing about using
  • 16:02 - 16:04
    this with bootstrap is it's definitely gonna
    work on
  • 16:04 - 16:09
    a mobile phone, so you'll be able to locate
  • 16:09 - 16:14
    it on your phone in bed. Yeah.
  • 16:14 - 16:16
    And this will change our life. This is a
  • 16:16 - 16:18
    big pain point for us is how we manage
  • 16:18 - 16:21
    our settings. We run Chef solo, right now
    we
  • 16:21 - 16:24
    don't use Chef's server, and our settings
    are in
  • 16:24 - 16:27
    a bunch of S3 buckets. We have rigged tools
  • 16:27 - 16:29
    you know to update all our buckets, but it's
  • 16:29 - 16:31
    definitely not the easiest to visualize. So.
  • 16:31 - 16:35
    It's gonna help us a lot.
  • 16:35 - 16:38
    So here's some of the, some of my inspiration
  • 16:38 - 16:42
    while doing this. Test-driven infrastructure
    with Chef. It's a
  • 16:42 - 16:45
    really quick read, it's like 70 pages. If
    you're
  • 16:45 - 16:49
    interested in, you know, testing, your deployment
    process, check
  • 16:49 - 16:52
    that out. It's kind of hand wave-y, but, there
  • 16:52 - 16:55
    is the code, there is the Chef, Cucumber Chef.
  • 16:55 - 16:57
    That the guy who wrote the book wrote, so
  • 16:57 - 16:59
    you can check that out too.
  • 16:59 - 17:03
    DevOps Weekly is a great, great newsletter.
    I pretty
  • 17:03 - 17:05
    much read every week. They bring up some really
  • 17:05 - 17:09
    cool tools and things people are working on.
    Just
  • 17:09 - 17:12
    release it is a very cool book in the
  • 17:12 - 17:17
    pragmatic programmer series or whoever releases
    that. It's, it's
  • 17:17 - 17:20
    Java based, but it's, it's all about, you
    know,
  • 17:20 - 17:24
    handling failure, and how important failure
    is a concept
  • 17:24 - 17:25
    to operations.
  • 17:25 - 17:28
    And, of course, you know, when you're building
    a
  • 17:28 - 17:31
    framework and you're really trying to find
    these, this
  • 17:31 - 17:36
    structure, patterns of enterprise architecture
    is a classic, and
  • 17:36 - 17:39
    Growing Object Oriented Software Guided By
    Tests, two really
  • 17:39 - 17:42
    great books, you know. The whole idea is to
  • 17:42 - 17:45
    you know start with your unit tests, then
    write
  • 17:45 - 17:46
    your class, so.
  • 17:46 - 17:48
    It's good.
  • 17:48 - 17:53
    That was actually pretty quick. So does anyone
    have
  • 17:53 - 17:54
    any questions?
  • 17:55 - 17:59
    I'm kind of done. Thanks.
Title:
Ruby Conf 2013 - How To Roll Your Own Ops Framework In Ruby (If You Really Have To)
Description:

more » « less
Duration:
18:24

English subtitles

Revisions