Blog post Technical, Java

Why are my tests unstable?

Danijel Pribic | Software Developer

18 Jan 2023 |

You made changes in your application, ensured unit test coverage, reviewed, and merged into your main branch. But suddenly, and seemingly randomly, the new unit tests fail. This article dives into possible reasons why tests may work perfectly when running alone in isolation but often fail when running in a group/set/suite.

Order of Execution

As in mathematics, the order of operations can be critical when running unit tests. Various assumptions about dependencies & prerequisites, as well as the state of data, must be valid. For example, in an order tracking application, test A validates that orders are sorted by location, and test B checks if order clean-up (deletion of orders) works correctly. If we call test B before A, test A may fail because there is nothing to sort.

Ideally, we should build our tests to not rely on other tests and/or to be idempotent. Both tests, A and B, should be independent of one another. But this is a common scenario because you can often simplify tests by leveraging dependencies. This is also a risk because some testing frameworks may execute tests in parallel or in random order.

Therefore, we should always keep in mind that the order of execution of our unit tests may change. That may even be a good thing from testing our "test code" perspective. And, even when inter-dependencies are unavoidable, our unit tests should clearly denote this. Test frameworks support defining dependencies for this purpose in various ways, for example, @DependsOn in JUnit.

Global settings

Unit Tests may act differently based on global settings or configuration, especially if the test code mutates these global values. Just as with the order of execution above, these global values are a dependency that can be affected by other tests that impact the expected data or workflow.

Let's look at an example:

@Test
public void testAnonymizeUser1(){
    setGlobalValue(Globals.MASK_USER_DATA, true);
    User user = getUser(1);
    assertFalse(user.getPrintableName().contains("John"));
    assertTrue(user.getPrintableName().contains("****"));
}

In the simple unit test above, we activate a global operation (this may be an environment variable, feature flag, static variable, etc.), which masks sensitive data. Our test works as expected, and everything is masked correctly. But there is a problem lurking…

We forgot to reset the global data state, and other tests which expect the global data to be unmasked will fail. In this case, the best practice is to reset any global/shared data element back to its default or previous state. You should do this in a finally {} block to avoid the risk of skipping the reset if some assertion fails or another exception occurs.

@Test
public void testAnonymizeUser2() {
    boolean isDataMasked = getGlobalValue(Globals.MASK_USER_DATA);
    try {
        setGlobalValue(Globals.MASK_USER_DATA, true);
        User user = getUser(1);

        assertFalse(user.getPrintableName().contains("John"));
        assertTrue(user.getPrintableName().contains("****"));
        //don’t reset here, as assertion above may fail, and prevent calling of lines after 
    } finally {
        //even if test fails, we will reset variable to previous state
        setGlobalValue(Globals.MASK_USER_DATA, isDataMasked);
    }
}

As a good alternative, test frameworks offer support to run methods to set up before and clean up after a set of unit tests. This facilitates the orchestration of dependencies and sharing them across multiple tests at the method, class, or test suite level (e.g., JUnit provides @before/@after annotations).

Timing

Another reason for unit tests to fail occasionally may be due to failed assumptions around timing or time of day. For example, your code changes and tests are passing successfully. But the next day, the testing report from the nightly runs says that it fails. Consider this logic:

public void processNewOrder(Order order) {
    if(isWorkingHours()){
        send(order);
        archive(order);
    } else {
        waitForNextWorkingDay(order);
    }
}

The functionality we are testing above triggers different workflows for different times of day, maybe even days of the week, etc. These problems may occur on weekends, public holidays, the beginning/end of the month, daylight saving time, etc.

In this case, a good practice is to force the desired date and time as part of the test dependencies set-up. And then exercise all branching possibilities so that it covers all workflows.

External Dependencies

Tests may sometimes fail when it depends on an external service that is unreliable or not available. But this usually occurs in the development of integration tests. A good practice to test external dependencies in a unit test would be to provide a mock for the service, so there is no external dependency to fail.

Concurrency Issues

Concurrency can also be an issue. Tests that utilize multiple threads, or async processing, in parallel or random/undefined order while processing some shared data may lead to an inconsistent state.

As a good practice, proper thread safety must be implemented anytime multi-threading is utilized in the application or tests. And the thread safety should also be validated/tested (via unit tests) to ensure a consistent state remains even under high load.

Other causes

Less likely, but still worth mentioning, are those edge cases when there is a difference between two systems that execute the tests. For example, there may be differences between running locally on your dev workstation vs. automated execution in the build pipeline.

There may be slight differences between OS versions, OS settings, environment variables, text encodings, test library versions, or coding language versions which can lead to different unexpected outcomes.

Troubleshooting & Best practices

In all of the above and other cases not described here, determining the root causes for unstable testing code can be tricky and sometimes complex. To identify the root causes, it's usually helpful to gather data from many different runs -- looking for patterns in behavior, comparing the logs, and comparing the inputs, outputs, and prerequisites.

In conclusion, the best advice we can provide is "don't leave your unstable test as-is and don't lazily disable them." It is best to investigate them immediately and follow up with fixes and any other technical debt clean-up you find. The time invested in resolving them will pay dividends in the future because unreliable tests slow down the development process, bring noise and confusion into the team, and continue to cause (false) alarms.

And, ultimately an unexpectedly failing test has accomplished its original goal – highlighting a weakness in the code base so that it can be more resilient!