AUSTIN PUTMAN: This is the last session before happy hour. I appreciate all of you for hanging around this long. Maybe you're here because, you don't know, there's a bar on the first floor of this hotel. I think that is where the main track is currently taking place. I am Austin Putman. I am the VP of engineering for Omada Health. At Omada, we support people at risk of chronic disease, like diabetes, make crucial behavior changes and have longer, healthier lives. So, it's pretty awesome. I'm gonna start with some spoilers, because I want you to have an amazing RailsConf. So if this is not what you're looking for, don't be shy about finding that bar track. We're gonna spend some quality time with Capybara and Cucumber, whose flakiness is legendary, for very good reasons. Let me take your temperature. Can I see hands? How many people have had problems with random failures in Cucumber or Capybara? Yeah. Yeah. This is reality, folks. We're also gonna cover the ways that Rspec does and does not help us track down test pollution. How many folks out there have had a random failure problem in the Rspec suite, like in your models or your controller tests? OK, still a lot of people, right. It happens. But we don't talk about it. So in between, we're gonna review some problems that can dog any test suite. This is like, random data, times zone heck, external dependencies. All this leads to pain. There was a great talk before about external dependencies. Just, here's just a random one. How many people here have had a test fail due to a daylight savings time issue? Yeah. Ben Franklin, you are a menace. Let's talk about eliminating inconsistent failures in your tests, and on our team, we call that fighting randos. And I'm here to talk about this, because I was stupid and short-sighted, and random failures caused us a lot of pain. I chose to try to hit deadlines instead of focusing on build quality, and our team paid a terrible price. Anybody out there paying that price? Anybody out there feel me on this? Yeah. It's, it sucks. So let's do some science. Some problems seem to have more random failure problems than others. I want to gather some data. So first, if you write tests on a regular basis, raise your hand. Right? Wow. I love RailsConf. Keep your hand up if you believe you have experienced a random test failure. The whole room. Now, if you think you're likely to have one in the next, like, four weeks. Who's out there? It's still happening, right. You're in the middle of it. OK, so this is not hypothetical for this audience. This is a widespread problem. But I don't see a lot of people talking about it. And the truth is, while being a great tool, a comprehensive integration suite, is like a breeding ground for baffling Heisenbugs. So, to understand how test failures become a chronic productivity blocker, I want to talk a little bit about testing culture, right. Why is this even bad? So, we have an automated CI machine that runs our full test suite every time a commit is pushed. And every time the bill passes, we push the new code to a staging environment for acceptance. Right, that's our process. How many people out there have a setup that's kind of like that? OK. Awesome. So a lot of people know what I'm talking about. So, in the fall of 2012, we started seeing occasional, unreproducible failures of the test suite in Jenkins. And we were pushing to get features out the door for January first. And we found that we could just rerun the build and the failure would go away. And we got pretty good at spotting the two or three tests where this happened. So, we would check the output of a failed build, and if it was one of the suspect tests, we would just run the build again. Not a problem. Staging would deploy. We would continue our march towards the launch. But by the time spring rolled around, there were like seven or eight places causing problems regularly. And we would try to fix them, you know, we wouldn't ignore them. But the failures were unreliable. So it was hard to say if we had actually fixed anything. And eventually we just added a gem called Cucumber-Rerun. Yeah. And this just reruns the failed specs if there's a problem. And when it passed the second time, it's good. You're fine. No big deal. And then some people on our team got ambitious, and they said, we could make it faster. We could make CI faster with the parallel_test gem, which is awesome. But Cucumber-Rerun and parallel_test are not compatible. And so we had a test suite that ran three times faster, but failed twice as often. And as we came into the fall, we had our first Bad Jenkins week. On a fateful Tuesday, 4 PM, the build just stopped passing. And there were anywhere from like thirty to seventy failures. And some of them were our usual suspects, and dozens of them were, like, previously good tests. Tests we trusted. And, so none of them failed in isolation, right. And after like two days of working on this, we eventually got a clean Rspec build, but Cucumber would still fail. And the failures could not be reproduced on a dev machine, or even on the same CI machine, outside of the, the whole build running. So, over the weekend, somebody pushes a commit and we get a green build. And there's nothing special about this commit, right. Like, it was like, a comment change. And we had tried a million things, and no single change obviously lead to the passing build. And the next week, we were back to like, you know, fifteen percent failure rate. Like, pretty good. So, we could push stories to staging again, and we're still under the deadline pressure, right. So, so we shrugged. And we moved on, right. And maybe somebody wants to guess, what happened next? Right? Yeah. It happened again, right. A whole week of just no tests pass. The build never passes. So we turn of parallel_tests, right. Because we can't even get like a coherent log of which tests are causing errors, and then we started commenting out the really problematic tests, and there were still these like seemingly innocuous specs that failed regularly but not consistently. So these are tests that have enough business value that we are very reluctant to just, like, delete them. And so we reinstated Cucumber-Rerun, and its buddy Rspec-Rerun. And this mostly worked, right. So we were making progress. But the build issues continued to show up in the negative column in our retrospectives. And that was because there were several problems with this situation, right. Like, reduced trust. When build failures happen four or five times a day, those aren't a red flag. Those are just how things go. And everyone on the team knows that the most likely explanation is a random failure. And the default response to a build failure becomes, run it again. So, just run it again, right. The build failed. Whatever. So then, occasionally, we break things for real. But we stopped noticing because we started expecting CI to be broken. Sometimes other pairs would pull the code and they would see the legitimate failures. Sometimes we thought we were having a Bad Jenkins week, and on the third or fourth day we realized we were having actual failures. This is pretty bad, right. So our system depends on green builds to mark the code that can be deployed to staging and production, and without green builds, stories can't get delivered and reviewed. So we stopped getting timely feedback. Meanwhile, the reviewer gets, like, a week's worth of stories. All at once. Big clump. And that means they have less time to pay attention to detail on each delivered feature. And that means that the product is a little bit crappier every week. So, maybe you need a bug fix. Fast. Forget about that. You've got, like, a twenty percent chance your bug fix build is gonna fail for no reason. Maybe the code has to ship, because the app is mega busted. In this case, we would rerun the failed tests on our local machine, and then cross our fingers and deploy. So, in effect, our policy was, if the code works on my machine, it can be deployed to production. So. At the most extreme, people lose faith in the build, and eventually they just forget about testing. And this didn't happen to us, but I had to explain to management that key features couldn't be shipped because of problems with the test server. And they wanted to know a lot more about the test server. And it was totally clear that while a working test server has their full support, an unreliable test server is a business liability and needs to be resolved. So, the test server is supposed to solve problems for us, and that is the only story that I like to tell about it. So, we began to fight back. And we personified the random failures. They became randos. A rando attack. A rando storm. And most memorably, Rando Backstabbian. Intergalactic randomness villain. We had a pair working on the test suite full time for about three months trying to resolve these issues. We tried about a thousand things, and some of them worked. And I'm gonna pass along the answers we found, and my hypothesis that we didn't disprove. Honestly, I'm hoping that you came to this talk because you've had similar problems and you found better solutions. So, this is just what we found. I, I have a very important tool for this section of the talk. It's the finger of blame. We use this a lot when we were like, hey, could the problem be Cucumber? And then we would go after that. So here comes finger of blame. Cucumber! Capybara. Poltergeist. Definitely part of the problem. I've talked to enough other teams that use these tools extensively, and the evidence from our audience, just to know that the results are just not as deterministic as we want. When you're using multiple threads and you're asserting against a browser environment, you're gonna have so issues, right. And one of those is browser environment, right. Browser environment is a euphemism for, like, a complicated piece of software that itself is a playground for network latency issues and rendering hiccups and a callback soup. So your tests have to be written in a very specific way to prevent all the threads and all the different layers of code from getting confused and smashing into each other. You know, some of you maybe are lucky, and you use the right style most of the time by default. Maybe you don't see that many problems. A few things you gotta never assume. Never assume the page has loaded. Never assume the markup you are asserting against exists. Never assume your AJAX request actually finished, and never assume the speed at which things happen, because until you bolt it down, you just don't know. So, always make sure the markup exists before you assert against it. New Capybara is supposed to be better at this, and it, it's improved. But I do not trust them. I am super paranoid about this stuff. This is a good example of a lurking rando, due to a race condition, in your browser. Capybara is supposed to wait for the page to load before it continues after the visit method, but I find it has sort of medium success with doing that. Bolt it down, right. We used to have something called the wait_until block, and that would stop execution until a condition was met. And that was great. Cause it replaced, like, sleep statements, which is what we used before that. Modern Capybara, no more wait_until block. It's inside the have_cc and have_content matcher. So, always assert that something exists before you try to do anything with it. And sometimes it might take a long time. The default timeout for that, for those Capybara assertions, is like five seconds. And sometimes, you need twenty seconds. Usually, for us, that's because we're doing like a file upload or another lengthy operation. But, again, never assume that things are gonna take a normal amount of time. Race conditions. I would be out of line to give this talk without talking explicitly about race conditions, right. Whenever you get, create a situation where a sequence of key events doesn't happen in a predetermined order, you've got a potential race condition. So the winner of the race is random. And that can create random outcomes in your test suite. So what's an example of one of those? AJAX. Right? In AJAX, your JavaScript running in Firefox may or may not complete its AJAX call and render the response before the test thread makes its assertions. Now, Capybara tries to fix this by retrying to assertions. But that doesn't always work. So, say you're clicking a button to submit a form, and then you're going to another page or refreshing the page. This might cut off that post request, whether it's from a regular form or an AJAX form, but especially if it's an AJAX request. As soon as you say, visit, all the outstanding AJAX requests cancel in your browser. So, you can fix this by adding an explicit wait into your Cucumber step, right. When you need to rig the race, jQuery provides this handy counter, dollar dot active. That's all the XHR requests that are outstanding. So, it's really not hard to keep an eye on what's going on. Here's another offender. Creating database objects from within the test thread, right. What's wrong with this approach? Now, if you're using mySQL, maybe nothing's wrong with this, right. And that's because mySQL has the transaction hygiene of a roadside diner, right. There's no separation. If you're using Postgres, which we are, it has stricter rules about the transactions. And this can create a world of pain. So, the test code and the Rails server are running in different threads. And this effectively means different database connections, and that means different transaction states. Now there is some shared database connection code out there. And I've had sort of mixed results with it. I've heard this thing, right, about shared mutable resources between threads being problematic. Like, they are. So let's say you're lucky, and both threads are in the same database transaction. Both the test thread and the server thread are issuing check points and rollbacks against the same connection. So sometimes one thread will reset to a checkpoint after the other thread has already rolled back the entire transaction. Right? And that's how you get a rando. So, you want to create some state within your application to run your test against, but you can't trust the test thread and the server thread to read the same database state, right. What do you do? So in our project, we use a single set of fixture data that's fixed at the beginning of the test run. And, essentially, the server thread, or the test thread, sorry, treats the database as immutable. It is read only, and any kind of verification of changes has to happen via the browser. So, we do this using RyanD's fixture_builder gem, to combine the maintainable characteristics of factoried objects with the, like, set it and forget it simplicity of fixtures. So, any state that needs to exist across multiple tests is stored in a set of fixtures, and those are used throughout the test suite. And this is great, except it's also terrible. Unfortunately, our fixture_builder definition file is like 900 lines long. And it's as dense as a Master's thesis, right. It takes about two minutes to rebuild the fixture set. And this happens when we rebundle, change the factories, change the schema. Fortunately, that only happens a couple of times a day, right. So mostly we're saving time with it. But seriously? Two minutes as your overhead to run one test is brutal. So, at our stage, we think the right solution is to use fixture_builder sparingly, right. Use it for Cucumber tests, because they need an immutable database. And maybe use it for core shared models for Rspec, but whatever you do, do not create like a DC Comics multiverse in your fixture setup file, with like different versions for everything, because that leads to pain. Another thing you want to do is Mutex it. So, a key technique we've used to prevent database collisions is to put a Mutex on access to the database. And this is crazy, but, you know, an app running in the browser can make more than one connection to the server at once over AJAX. And that's a great place to breed race conditions. So, unless you have a Mutex, to ensure the server only responds to one request at a time, you don't necessarily know the order in which things are gonna happen, and that means you're gonna get unreproducible failures. In effect, we use a Mutex to rig, rig the race. You can check it out on GitHub. It's just a sketch of the code we're using. It's on omadahealth slash capybara_sync. Faker. Some of the randomness in our test suite was due to inputs that we gave it. Our code depends on factories. And the factories used randomly generated fake data to fill in names, zip codes, all the text fields. And there are good reasons to use random data. It regularly exercises your edge cases. Engineers don't have to think of all possible first names you could use. The code should work the same regardless of what zip code someone is in. But sometimes it doesn't. For example, did you know that Faker includes Guam and Porto Rico in the states that it might generate for someone? And we didn't include those in our states dropdown. So when a Cucumber test edits an account for a user that Faker placed in Guam, and their state is not entered when you try to click save. And that leads to a validation failure, and that leads to Cucumber not seeing the expected results, and a test run with, from a new factory will not reproduce that failure, right. Something crazy happened. Here we go. Times and dates. Oh, we're out of sync. Let me just. Momentary technical difficulties. Mhmm. Cool. OK. Times and dates. Another subtle input to your code is the current time. Our app sets itself up to be on the user's time zone, to prevent time-dependent data, like which week of our program you are on in the middle of Saturday night. And this was policy. We all knew about this. We always used zone-aware time calls. Except that we didn't. Like, when I audited it, I found over a hundred places where we neglected to use zone-aware time calls. So most of these are fine. There's usually nothing wrong with epic seconds. But it only takes one misplaced call to time dot now to create a failure. It's really best to just forget about time dot now. Search your code base for it and eliminate it. Always use time dot zone dot now. Same thing for date dot today. That's time zone dependent. You want to use time dot zone dot today. Unsurprisingly, I found a bunch of this class of failure when I was at RubyConf in Miami. So these methods create random failures. Because your database objects can be in a different time zone than your machine's local time zone. External dependencies. Any time you depend on a third party service in your test, you introduce a possible random element, right. S3, Google Analytics, Facebook. Any of these things can go down. They can be slow. They can be broken. Additionally, they all depend on the quality of your local internet connection. So, I'm gonna suggest that if you are affected by random failures, it's important to reproduce the failure. It is possible. It is possible. It is not only possible. It is critical. And any problem that you can reproduce, reliably, can be solved. Well, at least, if you can reproduce it, you have a heck of a lot better chance of solving it. So, you have to bolt it all down. How do you fix the data? When you're trying to reproduce a random failure, you're gonna need the same database objects used by the failing test. So if you used factories, and there's not a file system record when a test starts to fail randomly, you're gonna want to document the database state at the time of failure. And that's gonna mean yml fixtures or, like, and SQL dump, or something else clever. You have to find a way to re-establish that same state that was created at the moment that you had the failure. And the network. Great talk before about how to nail down the network. API calls and responses are input for your code. Web-mock, vcr, other libraries exist to replay third party service responses. So, if you're trying to reproduce a failure in a test that has any third party dependencies, you're gonna wanna use a library to capture and replay those responses. Also, share buttons, right. In your Cucumber tests, you're gonna wanna remove the calls to Google Analytics, Facebook lite buttons, all that stuff from the browser. These slow down your page load time, and they create unnecessary failures because of that. But, if you're replaying all your network calls, how do you know the external API hasn't changed, right? You want to test the services that your code depends on, too. So you need a build that does that. But it shouldn't be the main build. Purpose of the main build is to let the team know when their code is broken, when their code is broken. And it should do that as quickly as possible. And then we have a separate, external build that tests the interactions with third party services. So, essentially, external communication is off and then on, and we check build results for both. So, I want to talk about another reason that tests fail randomly. Rspec runs all your tests in a random order every time. And obviously this introduces randomness. But, there is a reason for that, and the reason is to help you stay on top of test pollution. Test pollution is when state that is changed in one test persists and influences the results of other tests. Changed state can live in process memory, in a database, on the file system, in an external service. Right. Lots of places. Sometimes, the polluted state causes the subsequent test to fail incorrectly. And sometimes it causes the subsequent test to pass incorrectly. And this was such a rampant issue in the early days of Rspec that the Rspec team made running the tests in a random order the default as of Rspec 2. So, thank you Rspec. Now, any test pollution issues should stand out. But what do you think happens if you ignore random test failures for like a year or so? Yeah. Here's some clues that your issue might be test pollution, right. With test pollution, the effected tests never fail when they're run in isolation. Not ever. And rather than throwing an unexpected exception, a test pollution failure usually takes the form of returning different data than what you expected. And finally, the biggest clue that you might have a test pollution issue is that you haven't really been checking for test pollution. So, we gotta reproduce test pollution issues. Which means we have to run the tests suite in the same order, and we have to use the fixture or database data and the network data from the failed build. So, first you have to identify the random seed. Maybe you've seen this cryptic line at the end of your Rspec test output. This is not completely meaningless. 22164 is your magic key to rerun the test in the same order as the build that just ran. So you want to modify your dot Rspec file to include the seed value. Be sure to format, to change the format to documentation as well as adding the seed. That will make it more readable, for you, so that you can start to think about the order that things are running in and what could possibly be causing your pollution problem. So, the problem with test pollution is fundamentally about incorrectly persisted state, so that the state that's being persisted is important. You want to ensure that the data is identical to the failed build. And there's lots of ways to do this. So you've got your random seed. You've got your data from the failed build, and then you rerun the specs. And if you see the failure repeated, you should celebrate, right. You've correctly diagnosed that the issue is test pollution and you are on your way to fixing it. And if you don't see the failure, maybe it's not test pollution. Maybe there's another aspect of your build environment that needs to be duplicated, right. But even then, say you've reproduced the problem. Now what? You still have to diagnose what is causing the pollution. You know that running the tests in a particular order creates a failure. The problem with test pollution is that there is a non-obvious connection between where the problem appears in the failed test and its source in another test case. And you can find out about the failure using print statements or debugger, using whatever tools you want. But, maybe you get lucky and you are able to just figure it out. But in a complex code base with thousands of tests the source of the pollution can be tricky to track down. So, just running through the suite to reproduce the failure might take ten minutes. And this is actually terrible, right. Waiting ten minutes for feedback? This is a source of cognitive depletion. All of the stack you've built up in your brain to solve this problem is disintegrating over that ten minutes. You're gonna work on other problems. You're gonna check Facebook while those tests are running. And you're gonna lose your focus, right. And that is, essentially, how rando wins. Fortunately, we can discard large amounts of complexity and noise, by using a stupid process that we don't have to think about. Binary search. In code, debugging via binary search is a process of repeatedly dividing the search space in half, until you locate the smallest coherent unit that exhibits the desired behavior. OK. So we have the output of a set of specs that we ran in documentation mode. This is sort of a high level overview that you might see in Sublime, right. And in the middle here, this red spot is where the failure occurs. So we know the cause has to happen before the failure, because causality. So in the green block, at the top, is, that's the candidate block, or the search space. So, practically, we split the search space in half, and remove half of it. And if the failure reoccurs when we rerun with this configuration, we know that the cause is in that remaining block, right. But sometimes you've got more problems than you know. So it's good to test the other half of the search space as well. So if you're failure appeared in step zero, you expect not to see the failure here. If you also see the failure here, you might have multiple sources of test pollution or, more likely, test pollution isn't really your problem, and the problem is actually outside of the search space. So here's a hiccup. Binary search requires us to remove large segments of the test suite to narrow in on the test that causes the pollution. And this creates a problem, because random ordering in the test suite changes when you remove tests. Completely. Remove one test, the whole thing reshuffles on the same seeds. So there's no way to effectively perform a binary search using a random seed. So here's the good news. It is possible to manually declare the ordering of your Rspec tests, using this undocumented configuration option, order_examples. So, config dot order_examples takes a block, and that'll get the whole collection of Rspec examples after Rspec has loaded the specs to be run. And then you just reorder the examples in whatever order you want them to be ordered in and return that set from the block. So, that sounds simple. I, I made a little proto-gem for this. It's called rspec_manual_order, and basically it takes the output of the documentation format from the test that you ran earlier, and turns that into an ordering list. So, if you, if you log the output of, of your Rspec suite with the failure to a file, you'll be able to replay it using rspec_manual_order, and you can check that out on GitHub. So it's possible to reduce the search space and do a binary search on Rspec. And once you've reduced the search space to a single spec or a suite of examples that all cause the problem, you put your monkey in the position to shine against your test pollution issue, right. This is where it actually becomes possible to figure it out by looking at the context. I've gone in depth into test pollution, because it's amenable to investigation using simple techniques, right. Binary search and reproducing the failure state are key debugging skills that you will improve with practice. When I started looking into our random failures, I didn't know we had test pollution issues. Turned out we weren't resetting the global time zone correctly between tests. This was far from the only problem I found. But without fixing this one, our suite would never be clean. So, every random failure that you are chasing has its own unique story. There are some in our code that we haven't figured out yet, and there are some in your code that I hope I never see. The key to eliminating random test failures is don't give up, right. Today we've covered things that go wrong in Cucumber and Capybara. Things that go wrong in Rspec and just general sources of randomness in your test suite. And hopefully you're walking out of here with at least one new technique to improve the reliability of your tests. We've been working with ours for about eight months, and we're in a place where random failures occur like, less than five percent of the time. And we set up a tiered build system to run the tests sequentially when the fast parallel build fails. So, the important thing is that when new random failures occur, we reliably assign a team to hunt them down. And if you keep working on your build, eventually you'll figure out a combination of tactics that will lead to a stable, reliable test suite, that will have the trust of your team. So thank you.