RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman

Edit subtitles

0:17 - 0:20

AUSTIN PUTMAN: This is the last session before
happy hour.
0:20 - 0:25

I appreciate all of you for hanging around
0:25 - 0:28

this long. Maybe you're here because,
0:28 - 0:29

you don't know, there's a bar on the
0:29 - 0:32

first floor of this hotel. I think that
0:32 - 0:35

is where the main track is currently taking
place.
0:35 - 0:38

I am Austin Putman. I am the VP of
0:38 - 0:41

engineering for Omada Health. At Omada, we
support people
0:41 - 0:45

at risk of chronic disease, like diabetes,
make crucial
0:45 - 0:49

behavior changes and have longer, healthier
lives. So, it's
0:49 - 0:50

pretty awesome.
0:50 - 0:53

I'm gonna start with some spoilers, because
I want
0:53 - 0:55

you to have an amazing RailsConf. So if this
0:55 - 0:57

is not what you're looking for, don't be shy
0:57 - 1:01

about finding that bar track. We're gonna
spend some
1:01 - 1:05

quality time with Capybara and Cucumber, whose
flakiness is
1:05 - 1:07

legendary, for very good reasons.
1:07 - 1:11

Let me take your temperature. Can I see hands?
1:11 - 1:13

How many people have had problems with random
failures
1:13 - 1:18

in Cucumber or Capybara? Yeah. Yeah. This
is reality,
1:18 - 1:19

folks.
1:19 - 1:22

We're also gonna cover the ways that Rspec
does
1:22 - 1:25

and does not help us track down test pollution.
1:25 - 1:27

How many folks out there have had a random
1:27 - 1:29

failure problem in the Rspec suite, like in
your
1:29 - 1:32

models or your controller tests? OK, still
a lot
1:32 - 1:35

of people, right. It happens. But we don't
talk
1:35 - 1:37

about it.
1:37 - 1:40

So in between, we're gonna review some problems
that
1:40 - 1:43

can dog any test suite. This is like, random
1:43 - 1:47

data, times zone heck, external dependencies.
All this leads
1:47 - 1:48

to pain. There was a great talk before about
1:48 - 1:50

external dependencies.
1:50 - 1:53

Just, here's just a random one. How many people
1:53 - 1:54

here have had a test fail due to a
1:54 - 2:00

daylight savings time issue? Yeah. Ben Franklin,
you are
2:00 - 2:02

a menace.
2:02 - 2:06

Let's talk about eliminating inconsistent
failures in your tests,
2:06 - 2:10

and on our team, we call that fighting randos.
2:10 - 2:12

And I'm here to talk about this, because I
2:12 - 2:16

was stupid and short-sighted, and random failures
caused us
2:16 - 2:18

a lot of pain. I chose to try to
2:18 - 2:22

hit deadlines instead of focusing on build
quality, and
2:22 - 2:24

our team paid a terrible price.
2:24 - 2:26

Anybody out there paying that price? Anybody
out there
2:26 - 2:30

feel me on this? Yeah. It's, it sucks.
2:30 - 2:32

So let's do some science. Some problems seem
to
2:32 - 2:35

have more random failure problems than others.
I want
2:35 - 2:37

to gather some data. So first, if you write
2:37 - 2:41

tests on a regular basis, raise your hand.
Right?
2:41 - 2:45

Wow. I love RailsConf. Keep your hand up if
2:45 - 2:48

you believe you have experienced a random
test failure.
2:48 - 2:50

The whole room.
2:50 - 2:52

Now, if you think you're likely to have one
2:52 - 2:56

in the next, like, four weeks. Who's out there?
2:56 - 2:58

It's still happening, right. You're in the
middle of
2:58 - 3:01

it. OK, so this is not hypothetical for this
3:01 - 3:03

audience. This is a widespread problem. But
I don't
3:03 - 3:05

see a lot of people talking about it.
3:05 - 3:08

And the truth is, while being a great tool,
3:08 - 3:12

a comprehensive integration suite, is like
a breeding ground
3:12 - 3:14

for baffling Heisenbugs.
3:14 - 3:18

So, to understand how test failures become
a chronic
3:18 - 3:21

productivity blocker, I want to talk a little
bit
3:21 - 3:25

about testing culture, right. Why is this
even bad?
3:25 - 3:28

So, we have an automated CI machine that runs
3:28 - 3:31

our full test suite every time a commit is
3:31 - 3:34

pushed. And every time the bill passes, we
push
3:34 - 3:37

the new code to a staging environment for
acceptance.
3:37 - 3:39

Right, that's our process. How many people
out there
3:39 - 3:41

have a setup that's kind of like that? OK.
3:41 - 3:43

Awesome. So a lot of people know what I'm
3:43 - 3:44

talking about.
3:44 - 3:47

So, in the fall of 2012, we started seeing
3:47 - 3:52

occasional, unreproducible failures of the
test suite in Jenkins.
3:52 - 3:54

And we were pushing to get features out the
3:54 - 3:59

door for January first. And we found that
we
3:59 - 4:01

could just rerun the build and the failure
would
4:01 - 4:03

go away.
4:03 - 4:06

And we got pretty good at spotting the two
4:06 - 4:09

or three tests where this happened. So, we
would
4:09 - 4:12

check the output of a failed build, and if
4:12 - 4:15

it was one of the suspect tests, we would
4:15 - 4:18

just run the build again. Not a problem. Staging
4:18 - 4:21

would deploy. We would continue our march
towards the
4:21 - 4:23

launch.
4:23 - 4:24

But by the time spring rolled around, there
were
4:24 - 4:28

like seven or eight places causing problems
regularly. And
4:28 - 4:30

we would try to fix them, you know, we
4:30 - 4:33

wouldn't ignore them. But the failures were
unreliable. So
4:33 - 4:35

it was hard to say if we had actually
4:35 - 4:37

fixed anything.
4:37 - 4:39

And eventually we just added a gem called
Cucumber-Rerun.
4:39 - 4:46

Yeah. And this just reruns the failed specs
if
4:46 - 4:49

there's a problem. And when it passed the
second
4:49 - 4:54

time, it's good. You're fine. No big deal.
4:54 - 4:56

And then some people on our team got ambitious,
4:56 - 4:58

and they said, we could make it faster. We
4:58 - 5:02

could make CI faster with the parallel_test
gem, which
5:02 - 5:07

is awesome. But Cucumber-Rerun and parallel_test
are not compatible.
5:07 - 5:09

And so we had a test suite that ran
5:09 - 5:13

three times faster, but failed twice as often.
5:13 - 5:16

And as we came into the fall, we had
5:16 - 5:21

our first Bad Jenkins week. On a fateful Tuesday,
5:21 - 5:24

4 PM, the build just stopped passing. And
there
5:24 - 5:27

were anywhere from like thirty to seventy
failures. And
5:27 - 5:30

some of them were our usual suspects, and
dozens
5:30 - 5:34

of them were, like, previously good tests.
Tests we
5:34 - 5:35

trusted.
5:35 - 5:37

And, so none of them failed in isolation,
right.
5:37 - 5:40

And after like two days of working on this,
5:40 - 5:43

we eventually got a clean Rspec build, but
Cucumber
5:43 - 5:45

would still fail. And the failures could not
be
5:45 - 5:48

reproduced on a dev machine, or even on the
5:48 - 5:51

same CI machine, outside of the, the whole
build
5:51 - 5:52

running.
5:52 - 5:55

So, over the weekend, somebody pushes a commit
and
5:55 - 5:59

we get a green build. And there's nothing
special
5:59 - 6:01

about this commit, right. Like, it was like,
a
6:01 - 6:05

comment change. And we had tried a million
things,
6:05 - 6:08

and no single change obviously lead to the
passing
6:08 - 6:09

build.
6:09 - 6:11

And the next week, we were back to like,
6:11 - 6:15

you know, fifteen percent failure rate. Like,
pretty good.
6:15 - 6:19

So, we could push stories to staging again,
and
6:19 - 6:23

we're still under the deadline pressure, right.
So, so
6:23 - 6:28

we shrugged. And we moved on, right. And maybe
6:28 - 6:32

somebody wants to guess, what happened next?
Right?
6:32 - 6:35

Yeah. It happened again, right. A whole week
of
6:35 - 6:38

just no tests pass. The build never passes.
So
6:38 - 6:40

we turn of parallel_tests, right. Because
we can't even
6:40 - 6:42

get like a coherent log of which tests are
6:42 - 6:46

causing errors, and then we started commenting
out the
6:46 - 6:50

really problematic tests, and there were still
these like
6:50 - 6:55

seemingly innocuous specs that failed regularly
but not consistently.
6:55 - 6:57

So these are tests that have enough business
value
6:57 - 7:01

that we are very reluctant to just, like,
delete
7:01 - 7:01

them.
7:01 - 7:08

And so we reinstated Cucumber-Rerun, and its
buddy Rspec-Rerun.
7:08 - 7:13

And this mostly worked, right. So we were
making
7:13 - 7:15

progress. But the build issues continued to
show up
7:15 - 7:19

in the negative column in our retrospectives.
And that
7:19 - 7:21

was because there were several problems with
this situation,
7:21 - 7:26

right. Like, reduced trust. When build failures
happen four
7:26 - 7:28

or five times a day, those aren't a red
7:28 - 7:30

flag. Those are just how things go. And everyone
7:30 - 7:33

on the team knows that the most likely explanation
7:33 - 7:36

is a random failure.
7:36 - 7:38

And the default response to a build failure
becomes,
7:38 - 7:42

run it again. So, just run it again, right.
7:42 - 7:48

The build failed. Whatever. So then, occasionally,
we break
7:48 - 7:50

things for real. But we stopped noticing because
we
7:50 - 7:54

started expecting CI to be broken. Sometimes
other pairs
7:54 - 7:56

would pull the code and they would see the
7:56 - 8:00

legitimate failures. Sometimes we thought
we were having a
8:00 - 8:02

Bad Jenkins week, and on the third or fourth
8:02 - 8:06

day we realized we were having actual failures.
8:06 - 8:08

This is pretty bad, right.
8:08 - 8:12

So our system depends on green builds to mark
8:12 - 8:14

the code that can be deployed to staging and
8:14 - 8:17

production, and without green builds, stories
can't get delivered
8:17 - 8:22

and reviewed. So we stopped getting timely
feedback. Meanwhile,
8:22 - 8:25

the reviewer gets, like, a week's worth of
stories.
8:25 - 8:26

All at once. Big clump.
8:26 - 8:28

And that means they have less time to pay
8:28 - 8:32

attention to detail on each delivered feature.
And that
8:32 - 8:35

means that the product is a little bit crappier
8:35 - 8:40

every week. So, maybe you need a bug fix.
8:40 - 8:43

Fast. Forget about that. You've got, like,
a twenty
8:43 - 8:44

percent chance your bug fix build is gonna
fail
8:44 - 8:46

for no reason.
8:46 - 8:48

Maybe the code has to ship, because the app
8:48 - 8:52

is mega busted. In this case, we would rerun
8:52 - 8:54

the failed tests on our local machine, and
then
8:54 - 8:59

cross our fingers and deploy. So, in effect,
our
8:59 - 9:03

policy was, if the code works on my machine,
9:03 - 9:07

it can be deployed to production.
9:07 - 9:11

So. At the most extreme, people lose faith
in
9:11 - 9:14

the build, and eventually they just forget
about testing.
9:14 - 9:16

And this didn't happen to us, but I had
9:16 - 9:18

to explain to management that key features
couldn't be
9:18 - 9:22

shipped because of problems with the test
server. And
9:22 - 9:24

they wanted to know a lot more about the
9:24 - 9:27

test server. And it was totally clear that
while
9:27 - 9:30

a working test server has their full support,
an
9:30 - 9:33

unreliable test server is a business liability
and needs
9:33 - 9:35

to be resolved.
9:35 - 9:38

So, the test server is supposed to solve problems
9:38 - 9:40

for us, and that is the only story that
9:40 - 9:43

I like to tell about it. So, we began
9:43 - 9:48

to fight back. And we personified the random
failures.
9:48 - 9:53

They became randos. A rando attack. A rando
storm.
9:53 - 10:00

And most memorably, Rando Backstabbian. Intergalactic
randomness villain.
10:02 - 10:03

We had a pair working on the test suite
10:03 - 10:05

full time for about three months trying to
resolve
10:05 - 10:08

these issues. We tried about a thousand things,
and
10:08 - 10:10

some of them worked. And I'm gonna pass along
10:10 - 10:12

the answers we found, and my hypothesis that
we
10:12 - 10:17

didn't disprove. Honestly, I'm hoping that
you came to
10:17 - 10:19

this talk because you've had similar problems
and you
10:19 - 10:23

found better solutions. So, this is just what
we
10:23 - 10:24

found.
10:24 - 10:26

I, I have a very important tool for this
10:26 - 10:29

section of the talk. It's the finger of blame.
10:29 - 10:31

We use this a lot when we were like,
10:31 - 10:35

hey, could the problem be Cucumber? And then
we
10:35 - 10:38

would go after that. So here comes finger
of
10:38 - 10:39

blame.
10:39 - 10:46

Cucumber! Capybara. Poltergeist. Definitely
part of the problem. I've
10:46 - 10:49

talked to enough other teams that use these
tools
10:49 - 10:52

extensively, and the evidence from our audience,
just to
10:52 - 10:54

know that the results are just not as deterministic
10:54 - 10:57

as we want. When you're using multiple threads
and
10:57 - 11:01

you're asserting against a browser environment,
you're gonna have
11:01 - 11:02

so issues, right.
11:02 - 11:07

And one of those is browser environment, right.
Browser
11:07 - 11:10

environment is a euphemism for, like, a complicated
piece
11:10 - 11:13

of software that itself is a playground for
network
11:13 - 11:17

latency issues and rendering hiccups and a
callback soup.
11:17 - 11:19

So your tests have to be written in a
11:19 - 11:21

very specific way to prevent all the threads
and
11:21 - 11:23

all the different layers of code from getting
confused
11:23 - 11:25

and smashing into each other.
11:25 - 11:27

You know, some of you maybe are lucky, and
11:27 - 11:29

you use the right style most of the time
11:29 - 11:32

by default. Maybe you don't see that many
problems.
11:32 - 11:35

A few things you gotta never assume.
11:35 - 11:38

Never assume the page has loaded. Never assume
the
11:38 - 11:42

markup you are asserting against exists. Never
assume your
11:42 - 11:46

AJAX request actually finished, and never
assume the speed
11:46 - 11:49

at which things happen, because until you
bolt it
11:49 - 11:52

down, you just don't know.
11:52 - 11:57

So, always make sure the markup exists before
you
11:57 - 12:00

assert against it. New Capybara is supposed
to be
12:00 - 12:03

better at this, and it, it's improved. But
I
12:03 - 12:06

do not trust them. I am super paranoid about
12:06 - 12:09

this stuff. This is a good example of a
12:09 - 12:12

lurking rando, due to a race condition, in
your
12:12 - 12:13

browser.
12:13 - 12:14

Capybara is supposed to wait for the page
to
12:14 - 12:17

load before it continues after the visit method,
but
12:17 - 12:20

I find it has sort of medium success with
12:20 - 12:24

doing that. Bolt it down, right. We used to
12:24 - 12:26

have something called the wait_until block,
and that would
12:26 - 12:28

stop execution until a condition was met.
And that
12:28 - 12:31

was great. Cause it replaced, like, sleep
statements, which
12:31 - 12:33

is what we used before that.
12:33 - 12:37

Modern Capybara, no more wait_until block.
It's inside the
12:37 - 12:42

have_cc and have_content matcher. So, always
assert that something
12:42 - 12:46

exists before you try to do anything with
it.
12:46 - 12:49

And sometimes it might take a long time. The
12:49 - 12:52

default timeout for that, for those Capybara
assertions, is
12:52 - 12:56

like five seconds. And sometimes, you need
twenty seconds.
12:56 - 12:58

Usually, for us, that's because we're doing
like a
12:58 - 13:01

file upload or another lengthy operation.
But, again, never
13:01 - 13:04

assume that things are gonna take a normal
amount
13:04 - 13:07

of time.
13:07 - 13:10

Race conditions. I would be out of line to
13:10 - 13:13

give this talk without talking explicitly
about race conditions,
13:13 - 13:16

right. Whenever you get, create a situation
where a
13:16 - 13:20

sequence of key events doesn't happen in a
predetermined
13:20 - 13:25

order, you've got a potential race condition.
13:25 - 13:28

So the winner of the race is random. And
13:28 - 13:32

that can create random outcomes in your test
suite.
13:32 - 13:36

So what's an example of one of those? AJAX.
13:36 - 13:43

Right? In AJAX, your JavaScript running in
Firefox may
13:43 - 13:46

or may not complete its AJAX call and render
13:46 - 13:50

the response before the test thread makes
its assertions.
13:50 - 13:54

Now, Capybara tries to fix this by retrying
to
13:54 - 13:58

assertions. But that doesn't always work.
So, say you're
13:58 - 14:01

clicking a button to submit a form, and then
14:01 - 14:05

you're going to another page or refreshing
the page.
14:05 - 14:08

This might cut off that post request, whether
it's
14:08 - 14:10

from a regular form or an AJAX form, but
14:10 - 14:13

especially if it's an AJAX request. As soon
as
14:13 - 14:16

you say, visit, all the outstanding AJAX requests
cancel
14:16 - 14:18

in your browser.
14:18 - 14:21

So, you can fix this by adding an explicit
14:21 - 14:24

wait into your Cucumber step, right. When
you need
14:24 - 14:27

to rig the race, jQuery provides this handy
counter,
14:27 - 14:31

dollar dot active. That's all the XHR requests
that
14:31 - 14:33

are outstanding. So, it's really not hard
to keep
14:33 - 14:37

an eye on what's going on.
14:37 - 14:41

Here's another offender. Creating database
objects from within the
14:41 - 14:45

test thread, right. What's wrong with this
approach? Now,
14:45 - 14:49

if you're using mySQL, maybe nothing's wrong
with this,
14:49 - 14:54

right. And that's because mySQL has the transaction
hygiene
14:54 - 15:00

of a roadside diner, right. There's no separation.
If
15:00 - 15:02

you're using Postgres, which we are, it has
stricter
15:02 - 15:05

rules about the transactions. And this can
create a
15:05 - 15:07

world of pain.
15:07 - 15:10

So, the test code and the Rails server are
15:10 - 15:13

running in different threads. And this effectively
means different
15:13 - 15:17

database connections, and that means different
transaction states. Now
15:17 - 15:21

there is some shared database connection code
out there.
15:21 - 15:24

And I've had sort of mixed results with it.
15:24 - 15:27

I've heard this thing, right, about shared
mutable resources
15:27 - 15:32

between threads being problematic. Like, they
are. So let's
15:32 - 15:35

say you're lucky, and both threads are in
the
15:35 - 15:39

same database transaction. Both the test thread
and the
15:39 - 15:43

server thread are issuing check points and
rollbacks against
15:43 - 15:48

the same connection. So sometimes one thread
will reset
15:48 - 15:50

to a checkpoint after the other thread has
already
15:50 - 15:54

rolled back the entire transaction. Right?
And that's how
15:54 - 15:56

you get a rando.
15:56 - 15:59

So, you want to create some state within your
15:59 - 16:01

application to run your test against, but
you can't
16:01 - 16:04

trust the test thread and the server thread
to
16:04 - 16:07

read the same database state, right. What
do you
16:07 - 16:10

do?
16:10 - 16:12

So in our project, we use a single set
16:12 - 16:14

of fixture data that's fixed at the beginning
of
16:14 - 16:17

the test run. And, essentially, the server
thread, or
16:17 - 16:20

the test thread, sorry, treats the database
as immutable.
16:20 - 16:23

It is read only, and any kind of verification
16:23 - 16:26

of changes has to happen via the browser.
16:26 - 16:29

So, we do this using RyanD's fixture_builder
gem, to
16:29 - 16:34

combine the maintainable characteristics of
factoried objects with the,
16:34 - 16:38

like, set it and forget it simplicity of fixtures.
16:38 - 16:41

So, any state that needs to exist across multiple
16:41 - 16:43

tests is stored in a set of fixtures, and
16:43 - 16:45

those are used throughout the test suite.
16:45 - 16:51

And this is great, except it's also terrible.
Unfortunately,
16:51 - 16:56

our fixture_builder definition file is like
900 lines long.
16:56 - 16:59

And it's as dense as a Master's thesis, right.
16:59 - 17:02

It takes about two minutes to rebuild the
fixture
17:02 - 17:03

set. And this happens when we rebundle, change
the
17:03 - 17:07

factories, change the schema. Fortunately,
that only happens a
17:07 - 17:10

couple of times a day, right. So mostly we're
17:10 - 17:12

saving time with it. But seriously? Two minutes
as
17:12 - 17:15

your overhead to run one test is brutal.
17:15 - 17:17

So, at our stage, we think the right solution
17:17 - 17:21

is to use fixture_builder sparingly, right.
Use it for
17:21 - 17:24

Cucumber tests, because they need an immutable
database. And
17:24 - 17:28

maybe use it for core shared models for Rspec,
17:28 - 17:30

but whatever you do, do not create like a
17:30 - 17:34

DC Comics multiverse in your fixture setup
file, with
17:34 - 17:37

like different versions for everything, because
that leads to
17:37 - 17:38

pain.
17:38 - 17:41

Another thing you want to do is Mutex it.
17:41 - 17:44

So, a key technique we've used to prevent
database
17:44 - 17:47

collisions is to put a Mutex on access to
17:47 - 17:51

the database. And this is crazy, but, you
know,
17:51 - 17:54

an app running in the browser can make more
17:54 - 17:56

than one connection to the server at once
over
17:56 - 17:59

AJAX. And that's a great place to breed race
17:59 - 18:00

conditions.
18:00 - 18:04

So, unless you have a Mutex, to ensure the
18:04 - 18:07

server only responds to one request at a time,
18:07 - 18:09

you don't necessarily know the order in which
things
18:09 - 18:10

are gonna happen, and that means you're gonna
get
18:10 - 18:11

unreproducible failures.
18:11 - 18:14

In effect, we use a Mutex to rig, rig
18:14 - 18:16

the race. You can check it out on GitHub.
18:16 - 18:18

It's just a sketch of the code we're using.
18:18 - 18:24

It's on omadahealth slash capybara_sync.
18:24 - 18:31

Faker. Some of the randomness in our test
suite
18:31 - 18:33

was due to inputs that we gave it. Our
18:33 - 18:38

code depends on factories. And the factories
used randomly
18:38 - 18:42

generated fake data to fill in names, zip
codes,
18:42 - 18:44

all the text fields. And there are good reasons
18:44 - 18:45

to use random data.
18:45 - 18:49

It regularly exercises your edge cases. Engineers
don't have
18:49 - 18:52

to think of all possible first names you could
18:52 - 18:54

use. The code should work the same regardless
of
18:54 - 18:58

what zip code someone is in. But sometimes
it
18:58 - 19:00

doesn't.
19:00 - 19:04

For example, did you know that Faker includes
Guam
19:04 - 19:06

and Porto Rico in the states that it might
19:06 - 19:09

generate for someone? And we didn't include
those in
19:09 - 19:13

our states dropdown. So when a Cucumber test
edits
19:13 - 19:15

an account for a user that Faker placed in
19:15 - 19:18

Guam, and their state is not entered when
you
19:18 - 19:20

try to click save. And that leads to a
19:20 - 19:23

validation failure, and that leads to Cucumber
not seeing
19:23 - 19:26

the expected results, and a test run with,
from
19:26 - 19:33

a new factory will not reproduce that failure,
right.
19:37 - 19:40

Something crazy happened. Here we go.
19:40 - 19:46

Times and dates. Oh, we're out of sync. Let
19:46 - 19:53

me just. Momentary technical difficulties.
Mhmm.
20:02 - 20:09

Cool.
20:09 - 20:16

OK. Times and dates. Another subtle input
to your
20:19 - 20:24

code is the current time. Our app sets itself
20:24 - 20:27

up to be on the user's time zone, to
20:27 - 20:30

prevent time-dependent data, like which week
of our program
20:30 - 20:33

you are on in the middle of Saturday night.
20:33 - 20:35

And this was policy. We all knew about this.
20:35 - 20:38

We always used zone-aware time calls.
20:38 - 20:41

Except that we didn't. Like, when I audited
it,
20:41 - 20:43

I found over a hundred places where we neglected
20:43 - 20:47

to use zone-aware time calls. So most of these
20:47 - 20:51

are fine. There's usually nothing wrong with
epic seconds.
20:51 - 20:54

But it only takes one misplaced call to time
20:54 - 20:57

dot now to create a failure. It's really best
20:57 - 20:59

to just forget about time dot now. Search
your
20:59 - 21:02

code base for it and eliminate it. Always
use
21:02 - 21:04

time dot zone dot now. Same thing for date
21:04 - 21:07

dot today. That's time zone dependent. You
want to
21:07 - 21:09

use time dot zone dot today.
21:09 - 21:12

Unsurprisingly, I found a bunch of this class
of
21:12 - 21:15

failure when I was at RubyConf in Miami. So
21:15 - 21:19

these methods create random failures. Because
your database objects
21:19 - 21:21

can be in a different time zone than your
21:21 - 21:26

machine's local time zone.
21:26 - 21:29

External dependencies. Any time you depend
on a third
21:29 - 21:32

party service in your test, you introduce
a possible
21:32 - 21:37

random element, right. S3, Google Analytics,
Facebook. Any of
21:37 - 21:39

these things can go down. They can be slow.
21:39 - 21:43

They can be broken. Additionally, they all
depend on
21:43 - 21:46

the quality of your local internet connection.
21:46 - 21:50

So, I'm gonna suggest that if you are affected
21:50 - 21:55

by random failures, it's important to reproduce
the failure.
21:55 - 21:56

It is possible. It is possible. It is not
21:56 - 21:59

only possible. It is critical. And any problem
that
21:59 - 22:04

you can reproduce, reliably, can be solved.
Well, at
22:04 - 22:06

least, if you can reproduce it, you have a
22:06 - 22:10

heck of a lot better chance of solving it.
22:10 - 22:14

So, you have to bolt it all down. How
22:14 - 22:17

do you fix the data? When you're trying to
22:17 - 22:19

reproduce a random failure, you're gonna need
the same
22:19 - 22:22

database objects used by the failing test.
So if
22:22 - 22:24

you used factories, and there's not a file
system
22:24 - 22:27

record when a test starts to fail randomly,
you're
22:27 - 22:29

gonna want to document the database state
at the
22:29 - 22:30

time of failure.
22:30 - 22:33

And that's gonna mean yml fixtures or, like,
and
22:33 - 22:36

SQL dump, or something else clever. You have
to
22:36 - 22:38

find a way to re-establish that same state
that
22:38 - 22:42

was created at the moment that you had the
22:42 - 22:45

failure. And the network. Great talk before
about how
22:45 - 22:48

to nail down the network. API calls and responses
22:48 - 22:52

are input for your code. Web-mock, vcr, other
libraries
22:52 - 22:56

exist to replay third party service responses.
22:56 - 22:58

So, if you're trying to reproduce a failure
in
22:58 - 23:01

a test that has any third party dependencies,
you're
23:01 - 23:03

gonna wanna use a library to capture and replay
23:03 - 23:06

those responses.
23:06 - 23:10

Also, share buttons, right. In your Cucumber
tests, you're
23:10 - 23:13

gonna wanna remove the calls to Google Analytics,
Facebook
23:13 - 23:16

lite buttons, all that stuff from the browser.
These
23:16 - 23:18

slow down your page load time, and they create
23:18 - 23:22

unnecessary failures because of that.
23:22 - 23:24

But, if you're replaying all your network
calls, how
23:24 - 23:28

do you know the external API hasn't changed,
right?
23:28 - 23:29

You want to test the services that your code
23:29 - 23:31

depends on, too. So you need a build that
23:31 - 23:35

does that. But it shouldn't be the main build.
23:35 - 23:36

Purpose of the main build is to let the
23:36 - 23:40

team know when their code is broken, when
their
23:40 - 23:43

code is broken. And it should do that as
23:43 - 23:45

quickly as possible.
23:45 - 23:47

And then we have a separate, external build
that
23:47 - 23:51

tests the interactions with third party services.
So, essentially,
23:51 - 23:53

external communication is off and then on,
and we
23:53 - 23:54

check build results for both.
23:54 - 23:58

So, I want to talk about another reason that
23:58 - 24:01

tests fail randomly. Rspec runs all your tests
in
24:01 - 24:06

a random order every time. And obviously this
introduces
24:06 - 24:09

randomness. But, there is a reason for that,
and
24:09 - 24:11

the reason is to help you stay on top
24:11 - 24:14

of test pollution.
24:14 - 24:16

Test pollution is when state that is changed
in
24:16 - 24:20

one test persists and influences the results
of other
24:20 - 24:23

tests. Changed state can live in process memory,
in
24:23 - 24:26

a database, on the file system, in an external
24:26 - 24:29

service. Right. Lots of places.
24:29 - 24:33

Sometimes, the polluted state causes the subsequent
test to
24:33 - 24:37

fail incorrectly. And sometimes it causes
the subsequent test
24:37 - 24:41

to pass incorrectly. And this was such a rampant
24:41 - 24:44

issue in the early days of Rspec that the
24:44 - 24:46

Rspec team made running the tests in a random
24:46 - 24:50

order the default as of Rspec 2. So, thank
24:50 - 24:51

you Rspec.
24:51 - 24:56

Now, any test pollution issues should stand
out. But
24:56 - 24:59

what do you think happens if you ignore random
24:59 - 25:05

test failures for like a year or so? Yeah.
25:05 - 25:08

Here's some clues that your issue might be
test
25:08 - 25:10

pollution, right.
25:10 - 25:14

With test pollution, the effected tests never
fail when
25:14 - 25:18

they're run in isolation. Not ever. And rather
than
25:18 - 25:22

throwing an unexpected exception, a test pollution
failure usually
25:22 - 25:25

takes the form of returning different data
than what
25:25 - 25:26

you expected.
25:26 - 25:29

And finally, the biggest clue that you might
have
25:29 - 25:32

a test pollution issue is that you haven't
really
25:32 - 25:37

been checking for test pollution. So, we gotta
reproduce
25:37 - 25:39

test pollution issues. Which means we have
to run
25:39 - 25:41

the tests suite in the same order, and we
25:41 - 25:44

have to use the fixture or database data and
25:44 - 25:47

the network data from the failed build.
25:47 - 25:50

So, first you have to identify the random
seed.
25:50 - 25:52

Maybe you've seen this cryptic line at the
end
25:52 - 25:56

of your Rspec test output. This is not completely
25:56 - 26:00

meaningless. 22164 is your magic key to rerun
the
26:00 - 26:02

test in the same order as the build that
26:02 - 26:05

just ran. So you want to modify your dot
26:05 - 26:09

Rspec file to include the seed value. Be sure
26:09 - 26:12

to format, to change the format to documentation
as
26:12 - 26:13

well as adding the seed. That will make it
26:13 - 26:16

more readable, for you, so that you can start
26:16 - 26:17

to think about the order that things are running
26:17 - 26:20

in and what could possibly be causing your
pollution
26:20 - 26:21

problem.
26:21 - 26:25

So, the problem with test pollution is fundamentally
about
26:25 - 26:28

incorrectly persisted state, so that the state
that's being
26:28 - 26:32

persisted is important. You want to ensure
that the
26:32 - 26:35

data is identical to the failed build. And
there's
26:35 - 26:39

lots of ways to do this.
26:39 - 26:42

So you've got your random seed. You've got
your
26:42 - 26:44

data from the failed build, and then you rerun
26:44 - 26:47

the specs. And if you see the failure repeated,
26:47 - 26:50

you should celebrate, right. You've correctly
diagnosed that the
26:50 - 26:52

issue is test pollution and you are on your
26:52 - 26:53

way to fixing it.
26:53 - 26:56

And if you don't see the failure, maybe it's
26:56 - 26:59

not test pollution. Maybe there's another
aspect of your
26:59 - 27:03

build environment that needs to be duplicated,
right. But
27:03 - 27:08

even then, say you've reproduced the problem.
Now what?
27:08 - 27:11

You still have to diagnose what is causing
the
27:11 - 27:13

pollution. You know that running the tests
in a
27:13 - 27:17

particular order creates a failure. The problem
with test
27:17 - 27:20

pollution is that there is a non-obvious connection
between
27:20 - 27:23

where the problem appears in the failed test
and
27:23 - 27:26

its source in another test case.
27:26 - 27:28

And you can find out about the failure using
27:28 - 27:32

print statements or debugger, using whatever
tools you want.
27:32 - 27:35

But, maybe you get lucky and you are able
27:35 - 27:37

to just figure it out. But in a complex
27:37 - 27:41

code base with thousands of tests the source
of
27:41 - 27:45

the pollution can be tricky to track down.
27:45 - 27:49

So, just running through the suite to reproduce
the
27:49 - 27:52

failure might take ten minutes. And this is
actually
27:52 - 27:56

terrible, right. Waiting ten minutes for feedback?
This is
27:56 - 28:01

a source of cognitive depletion. All of the
stack
28:01 - 28:03

you've built up in your brain to solve this
28:03 - 28:06

problem is disintegrating over that ten minutes.
You're gonna
28:06 - 28:09

work on other problems. You're gonna check
Facebook while
28:09 - 28:11

those tests are running. And you're gonna
lose your
28:11 - 28:16

focus, right. And that is, essentially, how
rando wins.
28:16 - 28:21

Fortunately, we can discard large amounts
of complexity and
28:21 - 28:23

noise, by using a stupid process that we don't
28:23 - 28:29

have to think about. Binary search. In code,
debugging
28:29 - 28:32

via binary search is a process of repeatedly
dividing
28:32 - 28:36

the search space in half, until you locate
the
28:36 - 28:41

smallest coherent unit that exhibits the desired
behavior.
28:41 - 28:45

OK. So we have the output of a set
28:45 - 28:48

of specs that we ran in documentation mode.
This
28:48 - 28:50

is sort of a high level overview that you
28:50 - 28:52

might see in Sublime, right. And in the middle
28:52 - 28:56

here, this red spot is where the failure occurs.
28:56 - 29:00

So we know the cause has to happen before
29:00 - 29:05

the failure, because causality. So in the
green block,
29:05 - 29:09

at the top, is, that's the candidate block,
or
29:09 - 29:11

the search space.
29:11 - 29:17

So, practically, we split the search space
in half,
29:17 - 29:20

and remove half of it. And if the failure
29:20 - 29:23

reoccurs when we rerun with this configuration,
we know
29:23 - 29:28

that the cause is in that remaining block,
right.
29:28 - 29:31

But sometimes you've got more problems than
you know.
29:31 - 29:32

So it's good to test the other half of
29:32 - 29:36

the search space as well.
29:36 - 29:38

So if you're failure appeared in step zero,
you
29:38 - 29:41

expect not to see the failure here. If you
29:41 - 29:45

also see the failure here, you might have
multiple
29:45 - 29:49

sources of test pollution or, more likely,
test pollution
29:49 - 29:51

isn't really your problem, and the problem
is actually
29:51 - 29:55

outside of the search space.
29:55 - 29:59

So here's a hiccup. Binary search requires
us to
29:59 - 30:01

remove large segments of the test suite to
narrow
30:01 - 30:04

in on the test that causes the pollution.
And
30:04 - 30:07

this creates a problem, because random ordering
in the
30:07 - 30:11

test suite changes when you remove tests.
Completely. Remove
30:11 - 30:14

one test, the whole thing reshuffles on the
same
30:14 - 30:18

seeds. So there's no way to effectively perform
a
30:18 - 30:21

binary search using a random seed.
30:21 - 30:24

So here's the good news. It is possible to
30:24 - 30:27

manually declare the ordering of your Rspec
tests, using
30:27 - 30:34

this undocumented configuration option, order_examples.
So, config dot order_examples
30:35 - 30:38

takes a block, and that'll get the whole collection
30:38 - 30:41

of Rspec examples after Rspec has loaded the
specs
30:41 - 30:43

to be run. And then you just reorder the
30:43 - 30:45

examples in whatever order you want them to
be
30:45 - 30:49

ordered in and return that set from the block.
30:49 - 30:51

So, that sounds simple.
30:51 - 30:57

I, I made a little proto-gem for this. It's
30:57 - 31:01

called rspec_manual_order, and basically it
takes the output of
31:01 - 31:04

the documentation format from the test that
you ran
31:04 - 31:09

earlier, and turns that into an ordering list.
So,
31:09 - 31:11

if you, if you log the output of, of
31:11 - 31:14

your Rspec suite with the failure to a file,
31:14 - 31:16

you'll be able to replay it using rspec_manual_order,
and
31:16 - 31:19

you can check that out on GitHub.
31:19 - 31:25

So it's possible to reduce the search space
and
31:25 - 31:27

do a binary search on Rspec. And once you've
31:27 - 31:30

reduced the search space to a single spec
or
31:30 - 31:32

a suite of examples that all cause the problem,
31:32 - 31:36

you put your monkey in the position to shine
31:36 - 31:38

against your test pollution issue, right.
This is where
31:38 - 31:41

it actually becomes possible to figure it
out by
31:41 - 31:44

looking at the context.
31:44 - 31:46

I've gone in depth into test pollution, because
it's
31:46 - 31:51

amenable to investigation using simple techniques,
right. Binary search
31:51 - 31:54

and reproducing the failure state are key
debugging skills
31:54 - 31:57

that you will improve with practice. When
I started
31:57 - 31:59

looking into our random failures, I didn't
know we
31:59 - 32:03

had test pollution issues. Turned out we weren't
resetting
32:03 - 32:06

the global time zone correctly between tests.
32:06 - 32:08

This was far from the only problem I found.
32:08 - 32:11

But without fixing this one, our suite would
never
32:11 - 32:16

be clean. So, every random failure that you
are
32:16 - 32:20

chasing has its own unique story. There are
some
32:20 - 32:22

in our code that we haven't figured out yet,
32:22 - 32:24

and there are some in your code that I
32:24 - 32:27

hope I never see.
32:27 - 32:31

The key to eliminating random test failures
is don't
32:31 - 32:34

give up, right. Today we've covered things
that go
32:34 - 32:37

wrong in Cucumber and Capybara. Things that
go wrong
32:37 - 32:40

in Rspec and just general sources of randomness
in
32:40 - 32:42

your test suite. And hopefully you're walking
out of
32:42 - 32:44

here with at least one new technique to improve
32:44 - 32:46

the reliability of your tests.
32:46 - 32:48

We've been working with ours for about eight
months,
32:48 - 32:50

and we're in a place where random failures
occur
32:50 - 32:54

like, less than five percent of the time.
And
32:54 - 32:56

we set up a tiered build system to run
32:56 - 32:59

the tests sequentially when the fast parallel
build fails.
32:59 - 33:02

So, the important thing is that when new random
33:02 - 33:05

failures occur, we reliably assign a team
to hunt
33:05 - 33:05

them down.
33:05 - 33:08

And if you keep working on your build, eventually
33:08 - 33:10

you'll figure out a combination of tactics
that will
33:10 - 33:13

lead to a stable, reliable test suite, that
will
33:13 - 33:16

have the trust of your team. So thank you.

Title:: RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman
Description:: more » « less
Duration:: 33:45

Amara Bot edited English subtitles for RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman

English subtitles

Revisions

Revision 1 Imported

Amara Bot

RailsConf 2014 - Eliminating Inconsistent Test Failures by Austin Putman

Revisions

Our website uses cookies

Operating cookies (Required)