< Return to Video

RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman

  • 0:17 - 0:20
    AUSTIN PUTMAN: This is the last session before
    happy hour.
  • 0:20 - 0:25
    I appreciate all of you for hanging around
  • 0:25 - 0:28
    this long. Maybe you're here because,
  • 0:28 - 0:29
    you don't know, there's a bar on the
  • 0:29 - 0:32
    first floor of this hotel. I think that
  • 0:32 - 0:35
    is where the main track is currently taking
    place.
  • 0:35 - 0:38
    I am Austin Putman. I am the VP of
  • 0:38 - 0:41
    engineering for Omada Health. At Omada, we
    support people
  • 0:41 - 0:45
    at risk of chronic disease, like diabetes,
    make crucial
  • 0:45 - 0:49
    behavior changes and have longer, healthier
    lives. So, it's
  • 0:49 - 0:50
    pretty awesome.
  • 0:50 - 0:53
    I'm gonna start with some spoilers, because
    I want
  • 0:53 - 0:55
    you to have an amazing RailsConf. So if this
  • 0:55 - 0:57
    is not what you're looking for, don't be shy
  • 0:57 - 1:01
    about finding that bar track. We're gonna
    spend some
  • 1:01 - 1:05
    quality time with Capybara and Cucumber, whose
    flakiness is
  • 1:05 - 1:07
    legendary, for very good reasons.
  • 1:07 - 1:11
    Let me take your temperature. Can I see hands?
  • 1:11 - 1:13
    How many people have had problems with random
    failures
  • 1:13 - 1:18
    in Cucumber or Capybara? Yeah. Yeah. This
    is reality,
  • 1:18 - 1:19
    folks.
  • 1:19 - 1:22
    We're also gonna cover the ways that Rspec
    does
  • 1:22 - 1:25
    and does not help us track down test pollution.
  • 1:25 - 1:27
    How many folks out there have had a random
  • 1:27 - 1:29
    failure problem in the Rspec suite, like in
    your
  • 1:29 - 1:32
    models or your controller tests? OK, still
    a lot
  • 1:32 - 1:35
    of people, right. It happens. But we don't
    talk
  • 1:35 - 1:37
    about it.
  • 1:37 - 1:40
    So in between, we're gonna review some problems
    that
  • 1:40 - 1:43
    can dog any test suite. This is like, random
  • 1:43 - 1:47
    data, times zone heck, external dependencies.
    All this leads
  • 1:47 - 1:48
    to pain. There was a great talk before about
  • 1:48 - 1:50
    external dependencies.
  • 1:50 - 1:53
    Just, here's just a random one. How many people
  • 1:53 - 1:54
    here have had a test fail due to a
  • 1:54 - 2:00
    daylight savings time issue? Yeah. Ben Franklin,
    you are
  • 2:00 - 2:02
    a menace.
  • 2:02 - 2:06
    Let's talk about eliminating inconsistent
    failures in your tests,
  • 2:06 - 2:10
    and on our team, we call that fighting randos.
  • 2:10 - 2:12
    And I'm here to talk about this, because I
  • 2:12 - 2:16
    was stupid and short-sighted, and random failures
    caused us
  • 2:16 - 2:18
    a lot of pain. I chose to try to
  • 2:18 - 2:22
    hit deadlines instead of focusing on build
    quality, and
  • 2:22 - 2:24
    our team paid a terrible price.
  • 2:24 - 2:26
    Anybody out there paying that price? Anybody
    out there
  • 2:26 - 2:30
    feel me on this? Yeah. It's, it sucks.
  • 2:30 - 2:32
    So let's do some science. Some problems seem
    to
  • 2:32 - 2:35
    have more random failure problems than others.
    I want
  • 2:35 - 2:37
    to gather some data. So first, if you write
  • 2:37 - 2:41
    tests on a regular basis, raise your hand.
    Right?
  • 2:41 - 2:45
    Wow. I love RailsConf. Keep your hand up if
  • 2:45 - 2:48
    you believe you have experienced a random
    test failure.
  • 2:48 - 2:50
    The whole room.
  • 2:50 - 2:52
    Now, if you think you're likely to have one
  • 2:52 - 2:56
    in the next, like, four weeks. Who's out there?
  • 2:56 - 2:58
    It's still happening, right. You're in the
    middle of
  • 2:58 - 3:01
    it. OK, so this is not hypothetical for this
  • 3:01 - 3:03
    audience. This is a widespread problem. But
    I don't
  • 3:03 - 3:05
    see a lot of people talking about it.
  • 3:05 - 3:08
    And the truth is, while being a great tool,
  • 3:08 - 3:12
    a comprehensive integration suite, is like
    a breeding ground
  • 3:12 - 3:14
    for baffling Heisenbugs.
  • 3:14 - 3:18
    So, to understand how test failures become
    a chronic
  • 3:18 - 3:21
    productivity blocker, I want to talk a little
    bit
  • 3:21 - 3:25
    about testing culture, right. Why is this
    even bad?
  • 3:25 - 3:28
    So, we have an automated CI machine that runs
  • 3:28 - 3:31
    our full test suite every time a commit is
  • 3:31 - 3:34
    pushed. And every time the bill passes, we
    push
  • 3:34 - 3:37
    the new code to a staging environment for
    acceptance.
  • 3:37 - 3:39
    Right, that's our process. How many people
    out there
  • 3:39 - 3:41
    have a setup that's kind of like that? OK.
  • 3:41 - 3:43
    Awesome. So a lot of people know what I'm
  • 3:43 - 3:44
    talking about.
  • 3:44 - 3:47
    So, in the fall of 2012, we started seeing
  • 3:47 - 3:52
    occasional, unreproducible failures of the
    test suite in Jenkins.
  • 3:52 - 3:54
    And we were pushing to get features out the
  • 3:54 - 3:59
    door for January first. And we found that
    we
  • 3:59 - 4:01
    could just rerun the build and the failure
    would
  • 4:01 - 4:03
    go away.
  • 4:03 - 4:06
    And we got pretty good at spotting the two
  • 4:06 - 4:09
    or three tests where this happened. So, we
    would
  • 4:09 - 4:12
    check the output of a failed build, and if
  • 4:12 - 4:15
    it was one of the suspect tests, we would
  • 4:15 - 4:18
    just run the build again. Not a problem. Staging
  • 4:18 - 4:21
    would deploy. We would continue our march
    towards the
  • 4:21 - 4:23
    launch.
  • 4:23 - 4:24
    But by the time spring rolled around, there
    were
  • 4:24 - 4:28
    like seven or eight places causing problems
    regularly. And
  • 4:28 - 4:30
    we would try to fix them, you know, we
  • 4:30 - 4:33
    wouldn't ignore them. But the failures were
    unreliable. So
  • 4:33 - 4:35
    it was hard to say if we had actually
  • 4:35 - 4:37
    fixed anything.
  • 4:37 - 4:39
    And eventually we just added a gem called
    Cucumber-Rerun.
  • 4:39 - 4:46
    Yeah. And this just reruns the failed specs
    if
  • 4:46 - 4:49
    there's a problem. And when it passed the
    second
  • 4:49 - 4:54
    time, it's good. You're fine. No big deal.
  • 4:54 - 4:56
    And then some people on our team got ambitious,
  • 4:56 - 4:58
    and they said, we could make it faster. We
  • 4:58 - 5:02
    could make CI faster with the parallel_test
    gem, which
  • 5:02 - 5:07
    is awesome. But Cucumber-Rerun and parallel_test
    are not compatible.
  • 5:07 - 5:09
    And so we had a test suite that ran
  • 5:09 - 5:13
    three times faster, but failed twice as often.
  • 5:13 - 5:16
    And as we came into the fall, we had
  • 5:16 - 5:21
    our first Bad Jenkins week. On a fateful Tuesday,
  • 5:21 - 5:24
    4 PM, the build just stopped passing. And
    there
  • 5:24 - 5:27
    were anywhere from like thirty to seventy
    failures. And
  • 5:27 - 5:30
    some of them were our usual suspects, and
    dozens
  • 5:30 - 5:34
    of them were, like, previously good tests.
    Tests we
  • 5:34 - 5:35
    trusted.
  • 5:35 - 5:37
    And, so none of them failed in isolation,
    right.
  • 5:37 - 5:40
    And after like two days of working on this,
  • 5:40 - 5:43
    we eventually got a clean Rspec build, but
    Cucumber
  • 5:43 - 5:45
    would still fail. And the failures could not
    be
  • 5:45 - 5:48
    reproduced on a dev machine, or even on the
  • 5:48 - 5:51
    same CI machine, outside of the, the whole
    build
  • 5:51 - 5:52
    running.
  • 5:52 - 5:55
    So, over the weekend, somebody pushes a commit
    and
  • 5:55 - 5:59
    we get a green build. And there's nothing
    special
  • 5:59 - 6:01
    about this commit, right. Like, it was like,
    a
  • 6:01 - 6:05
    comment change. And we had tried a million
    things,
  • 6:05 - 6:08
    and no single change obviously lead to the
    passing
  • 6:08 - 6:09
    build.
  • 6:09 - 6:11
    And the next week, we were back to like,
  • 6:11 - 6:15
    you know, fifteen percent failure rate. Like,
    pretty good.
  • 6:15 - 6:19
    So, we could push stories to staging again,
    and
  • 6:19 - 6:23
    we're still under the deadline pressure, right.
    So, so
  • 6:23 - 6:28
    we shrugged. And we moved on, right. And maybe
  • 6:28 - 6:32
    somebody wants to guess, what happened next?
    Right?
  • 6:32 - 6:35
    Yeah. It happened again, right. A whole week
    of
  • 6:35 - 6:38
    just no tests pass. The build never passes.
    So
  • 6:38 - 6:40
    we turn of parallel_tests, right. Because
    we can't even
  • 6:40 - 6:42
    get like a coherent log of which tests are
  • 6:42 - 6:46
    causing errors, and then we started commenting
    out the
  • 6:46 - 6:50
    really problematic tests, and there were still
    these like
  • 6:50 - 6:55
    seemingly innocuous specs that failed regularly
    but not consistently.
  • 6:55 - 6:57
    So these are tests that have enough business
    value
  • 6:57 - 7:01
    that we are very reluctant to just, like,
    delete
  • 7:01 - 7:01
    them.
  • 7:01 - 7:08
    And so we reinstated Cucumber-Rerun, and its
    buddy Rspec-Rerun.
  • 7:08 - 7:13
    And this mostly worked, right. So we were
    making
  • 7:13 - 7:15
    progress. But the build issues continued to
    show up
  • 7:15 - 7:19
    in the negative column in our retrospectives.
    And that
  • 7:19 - 7:21
    was because there were several problems with
    this situation,
  • 7:21 - 7:26
    right. Like, reduced trust. When build failures
    happen four
  • 7:26 - 7:28
    or five times a day, those aren't a red
  • 7:28 - 7:30
    flag. Those are just how things go. And everyone
  • 7:30 - 7:33
    on the team knows that the most likely explanation
  • 7:33 - 7:36
    is a random failure.
  • 7:36 - 7:38
    And the default response to a build failure
    becomes,
  • 7:38 - 7:42
    run it again. So, just run it again, right.
  • 7:42 - 7:48
    The build failed. Whatever. So then, occasionally,
    we break
  • 7:48 - 7:50
    things for real. But we stopped noticing because
    we
  • 7:50 - 7:54
    started expecting CI to be broken. Sometimes
    other pairs
  • 7:54 - 7:56
    would pull the code and they would see the
  • 7:56 - 8:00
    legitimate failures. Sometimes we thought
    we were having a
  • 8:00 - 8:02
    Bad Jenkins week, and on the third or fourth
  • 8:02 - 8:06
    day we realized we were having actual failures.
  • 8:06 - 8:08
    This is pretty bad, right.
  • 8:08 - 8:12
    So our system depends on green builds to mark
  • 8:12 - 8:14
    the code that can be deployed to staging and
  • 8:14 - 8:17
    production, and without green builds, stories
    can't get delivered
  • 8:17 - 8:22
    and reviewed. So we stopped getting timely
    feedback. Meanwhile,
  • 8:22 - 8:25
    the reviewer gets, like, a week's worth of
    stories.
  • 8:25 - 8:26
    All at once. Big clump.
  • 8:26 - 8:28
    And that means they have less time to pay
  • 8:28 - 8:32
    attention to detail on each delivered feature.
    And that
  • 8:32 - 8:35
    means that the product is a little bit crappier
  • 8:35 - 8:40
    every week. So, maybe you need a bug fix.
  • 8:40 - 8:43
    Fast. Forget about that. You've got, like,
    a twenty
  • 8:43 - 8:44
    percent chance your bug fix build is gonna
    fail
  • 8:44 - 8:46
    for no reason.
  • 8:46 - 8:48
    Maybe the code has to ship, because the app
  • 8:48 - 8:52
    is mega busted. In this case, we would rerun
  • 8:52 - 8:54
    the failed tests on our local machine, and
    then
  • 8:54 - 8:59
    cross our fingers and deploy. So, in effect,
    our
  • 8:59 - 9:03
    policy was, if the code works on my machine,
  • 9:03 - 9:07
    it can be deployed to production.
  • 9:07 - 9:11
    So. At the most extreme, people lose faith
    in
  • 9:11 - 9:14
    the build, and eventually they just forget
    about testing.
  • 9:14 - 9:16
    And this didn't happen to us, but I had
  • 9:16 - 9:18
    to explain to management that key features
    couldn't be
  • 9:18 - 9:22
    shipped because of problems with the test
    server. And
  • 9:22 - 9:24
    they wanted to know a lot more about the
  • 9:24 - 9:27
    test server. And it was totally clear that
    while
  • 9:27 - 9:30
    a working test server has their full support,
    an
  • 9:30 - 9:33
    unreliable test server is a business liability
    and needs
  • 9:33 - 9:35
    to be resolved.
  • 9:35 - 9:38
    So, the test server is supposed to solve problems
  • 9:38 - 9:40
    for us, and that is the only story that
  • 9:40 - 9:43
    I like to tell about it. So, we began
  • 9:43 - 9:48
    to fight back. And we personified the random
    failures.
  • 9:48 - 9:53
    They became randos. A rando attack. A rando
    storm.
  • 9:53 - 10:00
    And most memorably, Rando Backstabbian. Intergalactic
    randomness villain.
  • 10:02 - 10:03
    We had a pair working on the test suite
  • 10:03 - 10:05
    full time for about three months trying to
    resolve
  • 10:05 - 10:08
    these issues. We tried about a thousand things,
    and
  • 10:08 - 10:10
    some of them worked. And I'm gonna pass along
  • 10:10 - 10:12
    the answers we found, and my hypothesis that
    we
  • 10:12 - 10:17
    didn't disprove. Honestly, I'm hoping that
    you came to
  • 10:17 - 10:19
    this talk because you've had similar problems
    and you
  • 10:19 - 10:23
    found better solutions. So, this is just what
    we
  • 10:23 - 10:24
    found.
  • 10:24 - 10:26
    I, I have a very important tool for this
  • 10:26 - 10:29
    section of the talk. It's the finger of blame.
  • 10:29 - 10:31
    We use this a lot when we were like,
  • 10:31 - 10:35
    hey, could the problem be Cucumber? And then
    we
  • 10:35 - 10:38
    would go after that. So here comes finger
    of
  • 10:38 - 10:39
    blame.
  • 10:39 - 10:46
    Cucumber! Capybara. Poltergeist. Definitely
    part of the problem. I've
  • 10:46 - 10:49
    talked to enough other teams that use these
    tools
  • 10:49 - 10:52
    extensively, and the evidence from our audience,
    just to
  • 10:52 - 10:54
    know that the results are just not as deterministic
  • 10:54 - 10:57
    as we want. When you're using multiple threads
    and
  • 10:57 - 11:01
    you're asserting against a browser environment,
    you're gonna have
  • 11:01 - 11:02
    so issues, right.
  • 11:02 - 11:07
    And one of those is browser environment, right.
    Browser
  • 11:07 - 11:10
    environment is a euphemism for, like, a complicated
    piece
  • 11:10 - 11:13
    of software that itself is a playground for
    network
  • 11:13 - 11:17
    latency issues and rendering hiccups and a
    callback soup.
  • 11:17 - 11:19
    So your tests have to be written in a
  • 11:19 - 11:21
    very specific way to prevent all the threads
    and
  • 11:21 - 11:23
    all the different layers of code from getting
    confused
  • 11:23 - 11:25
    and smashing into each other.
  • 11:25 - 11:27
    You know, some of you maybe are lucky, and
  • 11:27 - 11:29
    you use the right style most of the time
  • 11:29 - 11:32
    by default. Maybe you don't see that many
    problems.
  • 11:32 - 11:35
    A few things you gotta never assume.
  • 11:35 - 11:38
    Never assume the page has loaded. Never assume
    the
  • 11:38 - 11:42
    markup you are asserting against exists. Never
    assume your
  • 11:42 - 11:46
    AJAX request actually finished, and never
    assume the speed
  • 11:46 - 11:49
    at which things happen, because until you
    bolt it
  • 11:49 - 11:52
    down, you just don't know.
  • 11:52 - 11:57
    So, always make sure the markup exists before
    you
  • 11:57 - 12:00
    assert against it. New Capybara is supposed
    to be
  • 12:00 - 12:03
    better at this, and it, it's improved. But
    I
  • 12:03 - 12:06
    do not trust them. I am super paranoid about
  • 12:06 - 12:09
    this stuff. This is a good example of a
  • 12:09 - 12:12
    lurking rando, due to a race condition, in
    your
  • 12:12 - 12:13
    browser.
  • 12:13 - 12:14
    Capybara is supposed to wait for the page
    to
  • 12:14 - 12:17
    load before it continues after the visit method,
    but
  • 12:17 - 12:20
    I find it has sort of medium success with
  • 12:20 - 12:24
    doing that. Bolt it down, right. We used to
  • 12:24 - 12:26
    have something called the wait_until block,
    and that would
  • 12:26 - 12:28
    stop execution until a condition was met.
    And that
  • 12:28 - 12:31
    was great. Cause it replaced, like, sleep
    statements, which
  • 12:31 - 12:33
    is what we used before that.
  • 12:33 - 12:37
    Modern Capybara, no more wait_until block.
    It's inside the
  • 12:37 - 12:42
    have_cc and have_content matcher. So, always
    assert that something
  • 12:42 - 12:46
    exists before you try to do anything with
    it.
  • 12:46 - 12:49
    And sometimes it might take a long time. The
  • 12:49 - 12:52
    default timeout for that, for those Capybara
    assertions, is
  • 12:52 - 12:56
    like five seconds. And sometimes, you need
    twenty seconds.
  • 12:56 - 12:58
    Usually, for us, that's because we're doing
    like a
  • 12:58 - 13:01
    file upload or another lengthy operation.
    But, again, never
  • 13:01 - 13:04
    assume that things are gonna take a normal
    amount
  • 13:04 - 13:07
    of time.
  • 13:07 - 13:10
    Race conditions. I would be out of line to
  • 13:10 - 13:13
    give this talk without talking explicitly
    about race conditions,
  • 13:13 - 13:16
    right. Whenever you get, create a situation
    where a
  • 13:16 - 13:20
    sequence of key events doesn't happen in a
    predetermined
  • 13:20 - 13:25
    order, you've got a potential race condition.
  • 13:25 - 13:28
    So the winner of the race is random. And
  • 13:28 - 13:32
    that can create random outcomes in your test
    suite.
  • 13:32 - 13:36
    So what's an example of one of those? AJAX.
  • 13:36 - 13:43
    Right? In AJAX, your JavaScript running in
    Firefox may
  • 13:43 - 13:46
    or may not complete its AJAX call and render
  • 13:46 - 13:50
    the response before the test thread makes
    its assertions.
  • 13:50 - 13:54
    Now, Capybara tries to fix this by retrying
    to
  • 13:54 - 13:58
    assertions. But that doesn't always work.
    So, say you're
  • 13:58 - 14:01
    clicking a button to submit a form, and then
  • 14:01 - 14:05
    you're going to another page or refreshing
    the page.
  • 14:05 - 14:08
    This might cut off that post request, whether
    it's
  • 14:08 - 14:10
    from a regular form or an AJAX form, but
  • 14:10 - 14:13
    especially if it's an AJAX request. As soon
    as
  • 14:13 - 14:16
    you say, visit, all the outstanding AJAX requests
    cancel
  • 14:16 - 14:18
    in your browser.
  • 14:18 - 14:21
    So, you can fix this by adding an explicit
  • 14:21 - 14:24
    wait into your Cucumber step, right. When
    you need
  • 14:24 - 14:27
    to rig the race, jQuery provides this handy
    counter,
  • 14:27 - 14:31
    dollar dot active. That's all the XHR requests
    that
  • 14:31 - 14:33
    are outstanding. So, it's really not hard
    to keep
  • 14:33 - 14:37
    an eye on what's going on.
  • 14:37 - 14:41
    Here's another offender. Creating database
    objects from within the
  • 14:41 - 14:45
    test thread, right. What's wrong with this
    approach? Now,
  • 14:45 - 14:49
    if you're using mySQL, maybe nothing's wrong
    with this,
  • 14:49 - 14:54
    right. And that's because mySQL has the transaction
    hygiene
  • 14:54 - 15:00
    of a roadside diner, right. There's no separation.
    If
  • 15:00 - 15:02
    you're using Postgres, which we are, it has
    stricter
  • 15:02 - 15:05
    rules about the transactions. And this can
    create a
  • 15:05 - 15:07
    world of pain.
  • 15:07 - 15:10
    So, the test code and the Rails server are
  • 15:10 - 15:13
    running in different threads. And this effectively
    means different
  • 15:13 - 15:17
    database connections, and that means different
    transaction states. Now
  • 15:17 - 15:21
    there is some shared database connection code
    out there.
  • 15:21 - 15:24
    And I've had sort of mixed results with it.
  • 15:24 - 15:27
    I've heard this thing, right, about shared
    mutable resources
  • 15:27 - 15:32
    between threads being problematic. Like, they
    are. So let's
  • 15:32 - 15:35
    say you're lucky, and both threads are in
    the
  • 15:35 - 15:39
    same database transaction. Both the test thread
    and the
  • 15:39 - 15:43
    server thread are issuing check points and
    rollbacks against
  • 15:43 - 15:48
    the same connection. So sometimes one thread
    will reset
  • 15:48 - 15:50
    to a checkpoint after the other thread has
    already
  • 15:50 - 15:54
    rolled back the entire transaction. Right?
    And that's how
  • 15:54 - 15:56
    you get a rando.
  • 15:56 - 15:59
    So, you want to create some state within your
  • 15:59 - 16:01
    application to run your test against, but
    you can't
  • 16:01 - 16:04
    trust the test thread and the server thread
    to
  • 16:04 - 16:07
    read the same database state, right. What
    do you
  • 16:07 - 16:10
    do?
  • 16:10 - 16:12
    So in our project, we use a single set
  • 16:12 - 16:14
    of fixture data that's fixed at the beginning
    of
  • 16:14 - 16:17
    the test run. And, essentially, the server
    thread, or
  • 16:17 - 16:20
    the test thread, sorry, treats the database
    as immutable.
  • 16:20 - 16:23
    It is read only, and any kind of verification
  • 16:23 - 16:26
    of changes has to happen via the browser.
  • 16:26 - 16:29
    So, we do this using RyanD's fixture_builder
    gem, to
  • 16:29 - 16:34
    combine the maintainable characteristics of
    factoried objects with the,
  • 16:34 - 16:38
    like, set it and forget it simplicity of fixtures.
  • 16:38 - 16:41
    So, any state that needs to exist across multiple
  • 16:41 - 16:43
    tests is stored in a set of fixtures, and
  • 16:43 - 16:45
    those are used throughout the test suite.
  • 16:45 - 16:51
    And this is great, except it's also terrible.
    Unfortunately,
  • 16:51 - 16:56
    our fixture_builder definition file is like
    900 lines long.
  • 16:56 - 16:59
    And it's as dense as a Master's thesis, right.
  • 16:59 - 17:02
    It takes about two minutes to rebuild the
    fixture
  • 17:02 - 17:03
    set. And this happens when we rebundle, change
    the
  • 17:03 - 17:07
    factories, change the schema. Fortunately,
    that only happens a
  • 17:07 - 17:10
    couple of times a day, right. So mostly we're
  • 17:10 - 17:12
    saving time with it. But seriously? Two minutes
    as
  • 17:12 - 17:15
    your overhead to run one test is brutal.
  • 17:15 - 17:17
    So, at our stage, we think the right solution
  • 17:17 - 17:21
    is to use fixture_builder sparingly, right.
    Use it for
  • 17:21 - 17:24
    Cucumber tests, because they need an immutable
    database. And
  • 17:24 - 17:28
    maybe use it for core shared models for Rspec,
  • 17:28 - 17:30
    but whatever you do, do not create like a
  • 17:30 - 17:34
    DC Comics multiverse in your fixture setup
    file, with
  • 17:34 - 17:37
    like different versions for everything, because
    that leads to
  • 17:37 - 17:38
    pain.
  • 17:38 - 17:41
    Another thing you want to do is Mutex it.
  • 17:41 - 17:44
    So, a key technique we've used to prevent
    database
  • 17:44 - 17:47
    collisions is to put a Mutex on access to
  • 17:47 - 17:51
    the database. And this is crazy, but, you
    know,
  • 17:51 - 17:54
    an app running in the browser can make more
  • 17:54 - 17:56
    than one connection to the server at once
    over
  • 17:56 - 17:59
    AJAX. And that's a great place to breed race
  • 17:59 - 18:00
    conditions.
  • 18:00 - 18:04
    So, unless you have a Mutex, to ensure the
  • 18:04 - 18:07
    server only responds to one request at a time,
  • 18:07 - 18:09
    you don't necessarily know the order in which
    things
  • 18:09 - 18:10
    are gonna happen, and that means you're gonna
    get
  • 18:10 - 18:11
    unreproducible failures.
  • 18:11 - 18:14
    In effect, we use a Mutex to rig, rig
  • 18:14 - 18:16
    the race. You can check it out on GitHub.
  • 18:16 - 18:18
    It's just a sketch of the code we're using.
  • 18:18 - 18:24
    It's on omadahealth slash capybara_sync.
  • 18:24 - 18:31
    Faker. Some of the randomness in our test
    suite
  • 18:31 - 18:33
    was due to inputs that we gave it. Our
  • 18:33 - 18:38
    code depends on factories. And the factories
    used randomly
  • 18:38 - 18:42
    generated fake data to fill in names, zip
    codes,
  • 18:42 - 18:44
    all the text fields. And there are good reasons
  • 18:44 - 18:45
    to use random data.
  • 18:45 - 18:49
    It regularly exercises your edge cases. Engineers
    don't have
  • 18:49 - 18:52
    to think of all possible first names you could
  • 18:52 - 18:54
    use. The code should work the same regardless
    of
  • 18:54 - 18:58
    what zip code someone is in. But sometimes
    it
  • 18:58 - 19:00
    doesn't.
  • 19:00 - 19:04
    For example, did you know that Faker includes
    Guam
  • 19:04 - 19:06
    and Porto Rico in the states that it might
  • 19:06 - 19:09
    generate for someone? And we didn't include
    those in
  • 19:09 - 19:13
    our states dropdown. So when a Cucumber test
    edits
  • 19:13 - 19:15
    an account for a user that Faker placed in
  • 19:15 - 19:18
    Guam, and their state is not entered when
    you
  • 19:18 - 19:20
    try to click save. And that leads to a
  • 19:20 - 19:23
    validation failure, and that leads to Cucumber
    not seeing
  • 19:23 - 19:26
    the expected results, and a test run with,
    from
  • 19:26 - 19:33
    a new factory will not reproduce that failure,
    right.
  • 19:37 - 19:40
    Something crazy happened. Here we go.
  • 19:40 - 19:46
    Times and dates. Oh, we're out of sync. Let
  • 19:46 - 19:53
    me just. Momentary technical difficulties.
    Mhmm.
  • 20:02 - 20:09
    Cool.
  • 20:09 - 20:16
    OK. Times and dates. Another subtle input
    to your
  • 20:19 - 20:24
    code is the current time. Our app sets itself
  • 20:24 - 20:27
    up to be on the user's time zone, to
  • 20:27 - 20:30
    prevent time-dependent data, like which week
    of our program
  • 20:30 - 20:33
    you are on in the middle of Saturday night.
  • 20:33 - 20:35
    And this was policy. We all knew about this.
  • 20:35 - 20:38
    We always used zone-aware time calls.
  • 20:38 - 20:41
    Except that we didn't. Like, when I audited
    it,
  • 20:41 - 20:43
    I found over a hundred places where we neglected
  • 20:43 - 20:47
    to use zone-aware time calls. So most of these
  • 20:47 - 20:51
    are fine. There's usually nothing wrong with
    epic seconds.
  • 20:51 - 20:54
    But it only takes one misplaced call to time
  • 20:54 - 20:57
    dot now to create a failure. It's really best
  • 20:57 - 20:59
    to just forget about time dot now. Search
    your
  • 20:59 - 21:02
    code base for it and eliminate it. Always
    use
  • 21:02 - 21:04
    time dot zone dot now. Same thing for date
  • 21:04 - 21:07
    dot today. That's time zone dependent. You
    want to
  • 21:07 - 21:09
    use time dot zone dot today.
  • 21:09 - 21:12
    Unsurprisingly, I found a bunch of this class
    of
  • 21:12 - 21:15
    failure when I was at RubyConf in Miami. So
  • 21:15 - 21:19
    these methods create random failures. Because
    your database objects
  • 21:19 - 21:21
    can be in a different time zone than your
  • 21:21 - 21:26
    machine's local time zone.
  • 21:26 - 21:29
    External dependencies. Any time you depend
    on a third
  • 21:29 - 21:32
    party service in your test, you introduce
    a possible
  • 21:32 - 21:37
    random element, right. S3, Google Analytics,
    Facebook. Any of
  • 21:37 - 21:39
    these things can go down. They can be slow.
  • 21:39 - 21:43
    They can be broken. Additionally, they all
    depend on
  • 21:43 - 21:46
    the quality of your local internet connection.
  • 21:46 - 21:50
    So, I'm gonna suggest that if you are affected
  • 21:50 - 21:55
    by random failures, it's important to reproduce
    the failure.
  • 21:55 - 21:56
    It is possible. It is possible. It is not
  • 21:56 - 21:59
    only possible. It is critical. And any problem
    that
  • 21:59 - 22:04
    you can reproduce, reliably, can be solved.
    Well, at
  • 22:04 - 22:06
    least, if you can reproduce it, you have a
  • 22:06 - 22:10
    heck of a lot better chance of solving it.
  • 22:10 - 22:14
    So, you have to bolt it all down. How
  • 22:14 - 22:17
    do you fix the data? When you're trying to
  • 22:17 - 22:19
    reproduce a random failure, you're gonna need
    the same
  • 22:19 - 22:22
    database objects used by the failing test.
    So if
  • 22:22 - 22:24
    you used factories, and there's not a file
    system
  • 22:24 - 22:27
    record when a test starts to fail randomly,
    you're
  • 22:27 - 22:29
    gonna want to document the database state
    at the
  • 22:29 - 22:30
    time of failure.
  • 22:30 - 22:33
    And that's gonna mean yml fixtures or, like,
    and
  • 22:33 - 22:36
    SQL dump, or something else clever. You have
    to
  • 22:36 - 22:38
    find a way to re-establish that same state
    that
  • 22:38 - 22:42
    was created at the moment that you had the
  • 22:42 - 22:45
    failure. And the network. Great talk before
    about how
  • 22:45 - 22:48
    to nail down the network. API calls and responses
  • 22:48 - 22:52
    are input for your code. Web-mock, vcr, other
    libraries
  • 22:52 - 22:56
    exist to replay third party service responses.
  • 22:56 - 22:58
    So, if you're trying to reproduce a failure
    in
  • 22:58 - 23:01
    a test that has any third party dependencies,
    you're
  • 23:01 - 23:03
    gonna wanna use a library to capture and replay
  • 23:03 - 23:06
    those responses.
  • 23:06 - 23:10
    Also, share buttons, right. In your Cucumber
    tests, you're
  • 23:10 - 23:13
    gonna wanna remove the calls to Google Analytics,
    Facebook
  • 23:13 - 23:16
    lite buttons, all that stuff from the browser.
    These
  • 23:16 - 23:18
    slow down your page load time, and they create
  • 23:18 - 23:22
    unnecessary failures because of that.
  • 23:22 - 23:24
    But, if you're replaying all your network
    calls, how
  • 23:24 - 23:28
    do you know the external API hasn't changed,
    right?
  • 23:28 - 23:29
    You want to test the services that your code
  • 23:29 - 23:31
    depends on, too. So you need a build that
  • 23:31 - 23:35
    does that. But it shouldn't be the main build.
  • 23:35 - 23:36
    Purpose of the main build is to let the
  • 23:36 - 23:40
    team know when their code is broken, when
    their
  • 23:40 - 23:43
    code is broken. And it should do that as
  • 23:43 - 23:45
    quickly as possible.
  • 23:45 - 23:47
    And then we have a separate, external build
    that
  • 23:47 - 23:51
    tests the interactions with third party services.
    So, essentially,
  • 23:51 - 23:53
    external communication is off and then on,
    and we
  • 23:53 - 23:54
    check build results for both.
  • 23:54 - 23:58
    So, I want to talk about another reason that
  • 23:58 - 24:01
    tests fail randomly. Rspec runs all your tests
    in
  • 24:01 - 24:06
    a random order every time. And obviously this
    introduces
  • 24:06 - 24:09
    randomness. But, there is a reason for that,
    and
  • 24:09 - 24:11
    the reason is to help you stay on top
  • 24:11 - 24:14
    of test pollution.
  • 24:14 - 24:16
    Test pollution is when state that is changed
    in
  • 24:16 - 24:20
    one test persists and influences the results
    of other
  • 24:20 - 24:23
    tests. Changed state can live in process memory,
    in
  • 24:23 - 24:26
    a database, on the file system, in an external
  • 24:26 - 24:29
    service. Right. Lots of places.
  • 24:29 - 24:33
    Sometimes, the polluted state causes the subsequent
    test to
  • 24:33 - 24:37
    fail incorrectly. And sometimes it causes
    the subsequent test
  • 24:37 - 24:41
    to pass incorrectly. And this was such a rampant
  • 24:41 - 24:44
    issue in the early days of Rspec that the
  • 24:44 - 24:46
    Rspec team made running the tests in a random
  • 24:46 - 24:50
    order the default as of Rspec 2. So, thank
  • 24:50 - 24:51
    you Rspec.
  • 24:51 - 24:56
    Now, any test pollution issues should stand
    out. But
  • 24:56 - 24:59
    what do you think happens if you ignore random
  • 24:59 - 25:05
    test failures for like a year or so? Yeah.
  • 25:05 - 25:08
    Here's some clues that your issue might be
    test
  • 25:08 - 25:10
    pollution, right.
  • 25:10 - 25:14
    With test pollution, the effected tests never
    fail when
  • 25:14 - 25:18
    they're run in isolation. Not ever. And rather
    than
  • 25:18 - 25:22
    throwing an unexpected exception, a test pollution
    failure usually
  • 25:22 - 25:25
    takes the form of returning different data
    than what
  • 25:25 - 25:26
    you expected.
  • 25:26 - 25:29
    And finally, the biggest clue that you might
    have
  • 25:29 - 25:32
    a test pollution issue is that you haven't
    really
  • 25:32 - 25:37
    been checking for test pollution. So, we gotta
    reproduce
  • 25:37 - 25:39
    test pollution issues. Which means we have
    to run
  • 25:39 - 25:41
    the tests suite in the same order, and we
  • 25:41 - 25:44
    have to use the fixture or database data and
  • 25:44 - 25:47
    the network data from the failed build.
  • 25:47 - 25:50
    So, first you have to identify the random
    seed.
  • 25:50 - 25:52
    Maybe you've seen this cryptic line at the
    end
  • 25:52 - 25:56
    of your Rspec test output. This is not completely
  • 25:56 - 26:00
    meaningless. 22164 is your magic key to rerun
    the
  • 26:00 - 26:02
    test in the same order as the build that
  • 26:02 - 26:05
    just ran. So you want to modify your dot
  • 26:05 - 26:09
    Rspec file to include the seed value. Be sure
  • 26:09 - 26:12
    to format, to change the format to documentation
    as
  • 26:12 - 26:13
    well as adding the seed. That will make it
  • 26:13 - 26:16
    more readable, for you, so that you can start
  • 26:16 - 26:17
    to think about the order that things are running
  • 26:17 - 26:20
    in and what could possibly be causing your
    pollution
  • 26:20 - 26:21
    problem.
  • 26:21 - 26:25
    So, the problem with test pollution is fundamentally
    about
  • 26:25 - 26:28
    incorrectly persisted state, so that the state
    that's being
  • 26:28 - 26:32
    persisted is important. You want to ensure
    that the
  • 26:32 - 26:35
    data is identical to the failed build. And
    there's
  • 26:35 - 26:39
    lots of ways to do this.
  • 26:39 - 26:42
    So you've got your random seed. You've got
    your
  • 26:42 - 26:44
    data from the failed build, and then you rerun
  • 26:44 - 26:47
    the specs. And if you see the failure repeated,
  • 26:47 - 26:50
    you should celebrate, right. You've correctly
    diagnosed that the
  • 26:50 - 26:52
    issue is test pollution and you are on your
  • 26:52 - 26:53
    way to fixing it.
  • 26:53 - 26:56
    And if you don't see the failure, maybe it's
  • 26:56 - 26:59
    not test pollution. Maybe there's another
    aspect of your
  • 26:59 - 27:03
    build environment that needs to be duplicated,
    right. But
  • 27:03 - 27:08
    even then, say you've reproduced the problem.
    Now what?
  • 27:08 - 27:11
    You still have to diagnose what is causing
    the
  • 27:11 - 27:13
    pollution. You know that running the tests
    in a
  • 27:13 - 27:17
    particular order creates a failure. The problem
    with test
  • 27:17 - 27:20
    pollution is that there is a non-obvious connection
    between
  • 27:20 - 27:23
    where the problem appears in the failed test
    and
  • 27:23 - 27:26
    its source in another test case.
  • 27:26 - 27:28
    And you can find out about the failure using
  • 27:28 - 27:32
    print statements or debugger, using whatever
    tools you want.
  • 27:32 - 27:35
    But, maybe you get lucky and you are able
  • 27:35 - 27:37
    to just figure it out. But in a complex
  • 27:37 - 27:41
    code base with thousands of tests the source
    of
  • 27:41 - 27:45
    the pollution can be tricky to track down.
  • 27:45 - 27:49
    So, just running through the suite to reproduce
    the
  • 27:49 - 27:52
    failure might take ten minutes. And this is
    actually
  • 27:52 - 27:56
    terrible, right. Waiting ten minutes for feedback?
    This is
  • 27:56 - 28:01
    a source of cognitive depletion. All of the
    stack
  • 28:01 - 28:03
    you've built up in your brain to solve this
  • 28:03 - 28:06
    problem is disintegrating over that ten minutes.
    You're gonna
  • 28:06 - 28:09
    work on other problems. You're gonna check
    Facebook while
  • 28:09 - 28:11
    those tests are running. And you're gonna
    lose your
  • 28:11 - 28:16
    focus, right. And that is, essentially, how
    rando wins.
  • 28:16 - 28:21
    Fortunately, we can discard large amounts
    of complexity and
  • 28:21 - 28:23
    noise, by using a stupid process that we don't
  • 28:23 - 28:29
    have to think about. Binary search. In code,
    debugging
  • 28:29 - 28:32
    via binary search is a process of repeatedly
    dividing
  • 28:32 - 28:36
    the search space in half, until you locate
    the
  • 28:36 - 28:41
    smallest coherent unit that exhibits the desired
    behavior.
  • 28:41 - 28:45
    OK. So we have the output of a set
  • 28:45 - 28:48
    of specs that we ran in documentation mode.
    This
  • 28:48 - 28:50
    is sort of a high level overview that you
  • 28:50 - 28:52
    might see in Sublime, right. And in the middle
  • 28:52 - 28:56
    here, this red spot is where the failure occurs.
  • 28:56 - 29:00
    So we know the cause has to happen before
  • 29:00 - 29:05
    the failure, because causality. So in the
    green block,
  • 29:05 - 29:09
    at the top, is, that's the candidate block,
    or
  • 29:09 - 29:11
    the search space.
  • 29:11 - 29:17
    So, practically, we split the search space
    in half,
  • 29:17 - 29:20
    and remove half of it. And if the failure
  • 29:20 - 29:23
    reoccurs when we rerun with this configuration,
    we know
  • 29:23 - 29:28
    that the cause is in that remaining block,
    right.
  • 29:28 - 29:31
    But sometimes you've got more problems than
    you know.
  • 29:31 - 29:32
    So it's good to test the other half of
  • 29:32 - 29:36
    the search space as well.
  • 29:36 - 29:38
    So if you're failure appeared in step zero,
    you
  • 29:38 - 29:41
    expect not to see the failure here. If you
  • 29:41 - 29:45
    also see the failure here, you might have
    multiple
  • 29:45 - 29:49
    sources of test pollution or, more likely,
    test pollution
  • 29:49 - 29:51
    isn't really your problem, and the problem
    is actually
  • 29:51 - 29:55
    outside of the search space.
  • 29:55 - 29:59
    So here's a hiccup. Binary search requires
    us to
  • 29:59 - 30:01
    remove large segments of the test suite to
    narrow
  • 30:01 - 30:04
    in on the test that causes the pollution.
    And
  • 30:04 - 30:07
    this creates a problem, because random ordering
    in the
  • 30:07 - 30:11
    test suite changes when you remove tests.
    Completely. Remove
  • 30:11 - 30:14
    one test, the whole thing reshuffles on the
    same
  • 30:14 - 30:18
    seeds. So there's no way to effectively perform
    a
  • 30:18 - 30:21
    binary search using a random seed.
  • 30:21 - 30:24
    So here's the good news. It is possible to
  • 30:24 - 30:27
    manually declare the ordering of your Rspec
    tests, using
  • 30:27 - 30:34
    this undocumented configuration option, order_examples.
    So, config dot order_examples
  • 30:35 - 30:38
    takes a block, and that'll get the whole collection
  • 30:38 - 30:41
    of Rspec examples after Rspec has loaded the
    specs
  • 30:41 - 30:43
    to be run. And then you just reorder the
  • 30:43 - 30:45
    examples in whatever order you want them to
    be
  • 30:45 - 30:49
    ordered in and return that set from the block.
  • 30:49 - 30:51
    So, that sounds simple.
  • 30:51 - 30:57
    I, I made a little proto-gem for this. It's
  • 30:57 - 31:01
    called rspec_manual_order, and basically it
    takes the output of
  • 31:01 - 31:04
    the documentation format from the test that
    you ran
  • 31:04 - 31:09
    earlier, and turns that into an ordering list.
    So,
  • 31:09 - 31:11
    if you, if you log the output of, of
  • 31:11 - 31:14
    your Rspec suite with the failure to a file,
  • 31:14 - 31:16
    you'll be able to replay it using rspec_manual_order,
    and
  • 31:16 - 31:19
    you can check that out on GitHub.
  • 31:19 - 31:25
    So it's possible to reduce the search space
    and
  • 31:25 - 31:27
    do a binary search on Rspec. And once you've
  • 31:27 - 31:30
    reduced the search space to a single spec
    or
  • 31:30 - 31:32
    a suite of examples that all cause the problem,
  • 31:32 - 31:36
    you put your monkey in the position to shine
  • 31:36 - 31:38
    against your test pollution issue, right.
    This is where
  • 31:38 - 31:41
    it actually becomes possible to figure it
    out by
  • 31:41 - 31:44
    looking at the context.
  • 31:44 - 31:46
    I've gone in depth into test pollution, because
    it's
  • 31:46 - 31:51
    amenable to investigation using simple techniques,
    right. Binary search
  • 31:51 - 31:54
    and reproducing the failure state are key
    debugging skills
  • 31:54 - 31:57
    that you will improve with practice. When
    I started
  • 31:57 - 31:59
    looking into our random failures, I didn't
    know we
  • 31:59 - 32:03
    had test pollution issues. Turned out we weren't
    resetting
  • 32:03 - 32:06
    the global time zone correctly between tests.
  • 32:06 - 32:08
    This was far from the only problem I found.
  • 32:08 - 32:11
    But without fixing this one, our suite would
    never
  • 32:11 - 32:16
    be clean. So, every random failure that you
    are
  • 32:16 - 32:20
    chasing has its own unique story. There are
    some
  • 32:20 - 32:22
    in our code that we haven't figured out yet,
  • 32:22 - 32:24
    and there are some in your code that I
  • 32:24 - 32:27
    hope I never see.
  • 32:27 - 32:31
    The key to eliminating random test failures
    is don't
  • 32:31 - 32:34
    give up, right. Today we've covered things
    that go
  • 32:34 - 32:37
    wrong in Cucumber and Capybara. Things that
    go wrong
  • 32:37 - 32:40
    in Rspec and just general sources of randomness
    in
  • 32:40 - 32:42
    your test suite. And hopefully you're walking
    out of
  • 32:42 - 32:44
    here with at least one new technique to improve
  • 32:44 - 32:46
    the reliability of your tests.
  • 32:46 - 32:48
    We've been working with ours for about eight
    months,
  • 32:48 - 32:50
    and we're in a place where random failures
    occur
  • 32:50 - 32:54
    like, less than five percent of the time.
    And
  • 32:54 - 32:56
    we set up a tiered build system to run
  • 32:56 - 32:59
    the tests sequentially when the fast parallel
    build fails.
  • 32:59 - 33:02
    So, the important thing is that when new random
  • 33:02 - 33:05
    failures occur, we reliably assign a team
    to hunt
  • 33:05 - 33:05
    them down.
  • 33:05 - 33:08
    And if you keep working on your build, eventually
  • 33:08 - 33:10
    you'll figure out a combination of tactics
    that will
  • 33:10 - 33:13
    lead to a stable, reliable test suite, that
    will
  • 33:13 - 33:16
    have the trust of your team. So thank you.
Title:
RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman
Description:

more » « less
Duration:
33:45

English subtitles

Revisions