Solving Long, flaky testing with Azure DevOps

Tsuyoshi Ushio
6 min readMar 3, 2019

--

I contribute an open source project. The project is awesome, however we have a problem of testing. We need a long time to execute the integration testing, also the tests has some flaky tests. People gradually stop executing the tests because of the long/flaky tests. I decide to solve this problem. However, the configuration is simpler than expected. I’d like to share what I learned.

Multi agent build

Enabling Storage Emulator

We need to enable the Storage Emulator on the Hosted 2017 agent. Just simply create Command Line task then execute this code inline.

sqllocaldb create MSSQLLocalDB
sqllocaldb start MSSQLLocalDB
sqllocaldb info MSSQLLocalDB
"C:\Program Files (x86)\Microsoft SDKs\Azure\Storage Emulator\AzureStorageEmulator.exe" start

Configure Environment Variables

Azure Pipeline can configure variables. However, it is NOT environment variables. Let’s turn these into environment variables. The variables available on the task. Create a Powershell task which execute this commands to expose the environment variables.

[Environment]::SetEnvironmentVariable("DurableTaskTestStorageConnectionString", "$(DurableTaskTestStorageConnectionString)")

Then set the Variables

Pipeline variables.

CA0068 Error

We encounter this error. This happen by FxCop can’t find the pdb file.

##[error]CA0068 : Debug information could not be found for target assembly 'DurableTask.Core.dll'. For best analysis results, include the .pdb file with debug information for 'DurableTask.Core.dll' in the same directory as the target assembly.

When we search the Project, We found the settings on a props file. Currently just skip the FxCop code analysis by setting Configuration as debug.

<!-- Code Analysis Settings --><RunCodeAnalysis>True</RunCodeAnalysis><RunCodeAnalysis Condition=" '$(Configuration)' == 'Debug' ">False</RunCodeAnalysis>

You can stop this error by adding `/p:DebugType=pdbonly` on your Visual Studio Build Task, However, it eventually cause Sign an Assembly with Strong Name. It will take time, so this time I keep on using debug in this CI.

Update the Test Adapter

We encounter this error.

An exception occurred while invoking executor 'executor://mstestadapter/v2': Object '/0f369e97_078e_49a1_8dab_5e11d1ca83d6/fp+dq7nedbbafpjqlevc08sn_19.rem' has been disconnected or does not exist at the server.

According to the issue, We can solve this issue by upgrading the nuget packages MSTest.TestAdapter and MSTest.Framework to point at least 1.2.0+ on you Test project csproj file

<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" />    <PackageReference Include="MSTest.TestFramework" Version="1.4.0" />

The first try

Unfortunately, the task has been canceled. The reason is it took over 30 min. By default, Hosted agent will be canceled in 30 min. If you upgrade the plan, you can use it until 6 hours. I switch to Self-Hosted agent running on my machine to test the pipeline.

Hosted Agent cancel the task 30 min by default

Solving error on Self-Hosted agent

Once I change the agent, I’ve got this error. I haven’t seen the error on the Hosted agent.

Incorrect format for TestCaseFilter Missing Operator '|' or '&'. Specify the correct format and try again. Note that the incorrect format can lead to no test getting executed.

This error happens on Self-Hosted Agent. To prevent this, we can configure batch size. You need to put the number which can cover the number of the test cases.

But why? The reason is, private agent uses VsTest.Console.exe command. Command doesn’t parse the Test Filter correctly. If you enable the batch, Azure DevOps start to use API instead of the command. The API can parse the filter successfully. This is just an workaround.

Rerun option for flaky testing

Solving the flaky test is quite easy. Just configure rerun option on the Test task.

In our case, it is scenario testing for complex concurrency testing. It could fail by chance, however, if we retry several times, it will solve.

You can find flaky test from the test result. You can find the flaky test from the Passed on return checkbox.

Flaky test detection

Long running test issue

Now our pipeline works! However, it take too long. For only the testing, it took 22m+, Also, my PC was Surface Book 2. Very powerful dedicated machine. The Hosted agent is Standard_DS2_V2. How we can improve the execution time?

Works! But Long time

Parallel testing with multiple agent

Azure DevOps has a feature of Parallel testing with multiple agent. Let’s configure that.

Multi agent configuration

Yes! It works! You can see three agents works. This feature starts several agents at the same time, then execute at the same time. However, the tests are separated on three agents. We can use multi agents for public Azure DevOps repo up to 10. For the private, you can buy an agent.

Three agents works
3 agents

Works! However, 22m (private pc) -> 15m (Hosted 3 Agents) is not big deal. Hmm. Let’s try 5 agents with adding more tests.

5 agents with little more tests.

Hmm. Not big difference.

Change the algorithm

When I observe the execution, I notice one thing. The agents works properly. However, 4 of the agents already finished, only one agent keep on working. Every test execution is different! Deviation!

I notice that we can change the balancing algorithm. The default algorithm is simply split the test cases equally. Based on past running time of tests algorithm decide the test allocation based on the past execution time of the tests. How clever is it!

Based on past running time of tests

Try 5 agent with the algorithm.

5 agents with new algorithm

6 minutes! Amazing!

Conclusion

Flaky testing and Long Running test is two big issues on CI. It is said that proper CI should be less than 10 minutes to get proper feedback. If it is longer than that, people give up to wait the execution of test.

If I create the pipeline by my self, It might take very long time to achieve this. I was really surprise that how easy it is. I hope this helps.

Resources

Test in parallel

You can find the concept. You can see the sample for .NET, JavaScript, Python.

For more detail of the VS Test (.NET)

Flaky test

Very good to read about the flaky testing. This is advanced case study by MSFT.

--

--