Flaky Tests are Not Random Failures

Published in

Source Diving

5 min readNov 20, 2019

This is the first part of a two-part series of blog posts on flaky tests. You’ll find the second part here: A Methodological Approach to Fixing Flaky Tests.

Automated testing and continuous integration are common practices nowadays. They are a foundational aspect of our work at Cookpad, helping us achieve fast development cycles and break things less often.

Automated testing, however, comes with its own quirks, one of which being flaky tests.

What is a flaky test?

A flaky test is a test that’s unreliable in behaviour, meaning that it yields different results inconsistently.

One moment it will pass (and probably did when it was merged to the codebase), and the next it will suddenly fail, perhaps only to then pass again when re-run.

They are sometimes referred to as “random failures”, but in reality, it’s often less about actual randomness than very reproducible edge cases that happen in a seemingly random fashion.

Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite. As a result they need to be dealt with as soon as you can, before your entire deployment pipeline is compromised.
(Eradicating Non-Determinism in Tests — Why non-deterministic tests are a problem, Martin Fowler)

How can a test be flaky?

The majority of the time, a test’s flakiness is not due to randomness.

If conditions can be reproduced accurately, then the test will always behave the same. Below are a few common patterns to look for:

Leaking state

To be reliable, a test case needs to always run within the same set of conditions.

Often this is achieved via a teardown phase that will restore the environment to a pristine state.

When a test case fails to do a proper teardown, the environment may end up in a state different from what it was before and have an impact on the following tests.

If a test suite is running tests in a random order, then flaky failures may show up. However, running the same test suite in the same order (same seed) will always yield the same results.

Here is an example:

If the two specs above run in the order they were defined, they will pass happily.

However, if they are run in the opposite order, then returns false if the user is logged out will fail.

The reason is that the second test’s state leaks out: Current.user does not get reset before the other test runs, so we still have a logged-in user.

One way to fix this is to make sure Current.user is properly reset after every spec:

Relying on non-deterministic systems

Sometimes, tests will make assumptions on what a system will output whereas that output is not as deterministic as imagined.

A common manifestation of this in flaky tests is ordering in databases. Take the following test:

This test’s goal is to make sure that Recipe.published only returns published recipes.

Depending on the implementation and the database system backing the application, it could be a flaky test: ordering in SQL is non-deterministic, and a query such as SELECT * FROM recipes WHERE published_at IS NOT NULL is not guaranteed to always return the results in the same order.

This test might usually pass, but occasionally fail because [recipe2, recipe1] is not equal to [recipe1, recipe2].

There are numerous approaches in fixing this flaky test, but they all fall into one of the two categories below:

The test is broken: if the test is assuming the result’s order, whereas the application is not meant to guarantee any order, then the test needs to be relaxed. For example, the following expectation would ignore ordering:

expect(Recipe.published).to contain_exactly(recipe1, recipe2)

The app is not behaving as expected: if the application is indeed expected to return recipes in a given order, then this needs to be fixed, and tests need to be clarified. For example:

Unintended “concurrency”

We deliberately didn’t sort by published_at in the previous example because this can be another source of flakiness.

Take the following test:

This will pass most of the time if the database column’s DateTime precision is high enough and enough time passes between creating the two recipes that the result is different published_at timestamps.

If however the column precision is low or the system is fast enough that the recipes are essentially created at the same time, we’re back to a non-deterministic system that cannot guarantee the order in which those recipes will be returned.

This can also creep in slowly as hardware becomes more powerful, and less time passes between the two publications.

To fix it, we can explicitly define the value of published_at for the two recipes:

recipe1 = create(:recipe, :published, published_at: 1.day.ago)
recipe2 = create(:recipe, :published, published_at: Time.current)

Alternatively, we can freeze system time before creating each recipe:

travel_to(1.day.ago)
recipe1 = create(:recipe, :published)
travel_back
recipe2 = create(:recipe, :published)

Time passing

The opposite kind of flaky test, that forgets to consider the passing of time, also exists. This kind of test would assume that the respective times at which two successive instructions happened are the same:

In this case, it is possible to use freeze_time so that Time.current returns the same value that was set when publishing the recipe:

Async JavaScript behaviour

At Cookpad, we use Capybara and ChromeDriver to test the behaviour of features relying on JavaScript.

It is worth noting that since the browser runs in a different process, what happens in there is not synchronized with the test’s commands.

When for example we simulate a click on a link with click_on, the instruction returns and execution continues to the next line as soon as the click instruction was sent to the browser.

It doesn’t mean however that the browser is done processing the chain of events triggered by that click.

In the example below, visiting the bookmarks page could stop the JavaScript execution before it even got a chance to send the AJAX request adding the recipe to the bookmarks.

The expectation might succeed or fail, depending on conditions we can’t control (eg. browser speed).

To prevent this kind of race condition, we usually use Capybara’s ability to synchronize expectations. Thehave_text matcher, for example, will wait and retry until it succeeds or times out:

The additional expectation will make sure that a confirmation message shows on the page before continuing.

If it fails the first time, Capybara will sleep for 10ms then retry and will keep doing so until either the assertion succeeds, or the wait exceeds Capybara.default_max_wait_time.

When the confirmation message eventually shows on the page, the expectation will succeed and execution will continue to the next operation, now we’re confident the recipe was added to the bookmarks.

Flaky != random

Even though flaky tests appear to happen randomly, they’re usually triggered by a very reproducible set of conditions.

The patterns listed above are only a few examples of what could cause a test to be flaky, but they all have one thing in common that is pretty much a constant: a test is flaky because the conditions, the environment in which it runs fluctuate.

As soon as that source of instability is located and fixed, any impression of randomness disappears, and the test behaves consistently again.

Having a good idea of how some flaky tests can happen helps to avoid them, but won’t completely prevent them from appearing. In the second part of my write-up, I’ll focus on a methodological approach to fixing flaky tests.