Solving Long, flaky testing with Azure DevOps
I contribute an open source project. The project is awesome, however we have a problem of testing. We need a long time to execute the integration testing, also the tests has some flaky tests. People gradually stop executing the tests because of the long/flaky tests. I decide to solve this problem. However, the configuration is simpler than expected. I’d like to share what I learned.
Enabling Storage Emulator
We need to enable the Storage Emulator on the Hosted 2017 agent. Just simply create Command Line task then execute this code inline.
sqllocaldb create MSSQLLocalDB
sqllocaldb start MSSQLLocalDB
sqllocaldb info MSSQLLocalDB"C:\Program Files (x86)\Microsoft SDKs\Azure\Storage Emulator\AzureStorageEmulator.exe" start
Configure Environment Variables
Azure Pipeline can configure variables. However, it is NOT environment variables. Let’s turn these into environment variables. The variables available on the task. Create a Powershell task which execute this commands to expose the environment variables.
[Environment]::SetEnvironmentVariable("DurableTaskTestStorageConnectionString", "$(DurableTaskTestStorageConnectionString)")
Then set the Variables
CA0068 Error
We encounter this error. This happen by FxCop can’t find the pdb file.
##[error]CA0068 : Debug information could not be found for target assembly 'DurableTask.Core.dll'. For best analysis results, include the .pdb file with debug information for 'DurableTask.Core.dll' in the same directory as the target assembly.
When we search the Project, We found the settings on a props file. Currently just skip the FxCop code analysis by setting Configuration as debug.
<!-- Code Analysis Settings --><RunCodeAnalysis>True</RunCodeAnalysis><RunCodeAnalysis Condition=" '$(Configuration)' == 'Debug' ">False</RunCodeAnalysis>
You can stop this error by adding `/p:DebugType=pdbonly` on your Visual Studio Build Task, However, it eventually cause Sign an Assembly with Strong Name. It will take time, so this time I keep on using debug in this CI.
Update the Test Adapter
We encounter this error.
An exception occurred while invoking executor 'executor://mstestadapter/v2': Object '/0f369e97_078e_49a1_8dab_5e11d1ca83d6/fp+dq7nedbbafpjqlevc08sn_19.rem' has been disconnected or does not exist at the server.
According to the issue, We can solve this issue by upgrading the nuget packages MSTest.TestAdapter and MSTest.Framework to point at least 1.2.0+ on you Test project csproj file
<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" /> <PackageReference Include="MSTest.TestFramework" Version="1.4.0" />
The first try
Unfortunately, the task has been canceled. The reason is it took over 30 min. By default, Hosted agent will be canceled in 30 min. If you upgrade the plan, you can use it until 6 hours. I switch to Self-Hosted agent running on my machine to test the pipeline.
Solving error on Self-Hosted agent
Once I change the agent, I’ve got this error. I haven’t seen the error on the Hosted agent.
Incorrect format for TestCaseFilter Missing Operator '|' or '&'. Specify the correct format and try again. Note that the incorrect format can lead to no test getting executed.
This error happens on Self-Hosted Agent. To prevent this, we can configure batch size. You need to put the number which can cover the number of the test cases.
But why? The reason is, private agent uses VsTest.Console.exe command. Command doesn’t parse the Test Filter correctly. If you enable the batch, Azure DevOps start to use API instead of the command. The API can parse the filter successfully. This is just an workaround.
Rerun option for flaky testing
Solving the flaky test is quite easy. Just configure rerun option on the Test task.
In our case, it is scenario testing for complex concurrency testing. It could fail by chance, however, if we retry several times, it will solve.
You can find flaky test from the test result. You can find the flaky test from the Passed on return
checkbox.
Long running test issue
Now our pipeline works! However, it take too long. For only the testing, it took 22m+, Also, my PC was Surface Book 2. Very powerful dedicated machine. The Hosted agent is Standard_DS2_V2. How we can improve the execution time?
Parallel testing with multiple agent
Azure DevOps has a feature of Parallel testing with multiple agent. Let’s configure that.
Yes! It works! You can see three agents works. This feature starts several agents at the same time, then execute at the same time. However, the tests are separated on three agents. We can use multi agents for public Azure DevOps repo up to 10. For the private, you can buy an agent.
Works! However, 22m (private pc) -> 15m (Hosted 3 Agents) is not big deal. Hmm. Let’s try 5 agents with adding more tests.
Hmm. Not big difference.
Change the algorithm
When I observe the execution, I notice one thing. The agents works properly. However, 4 of the agents already finished, only one agent keep on working. Every test execution is different! Deviation!
I notice that we can change the balancing algorithm. The default algorithm is simply split the test cases equally. Based on past running time of tests
algorithm decide the test allocation based on the past execution time of the tests. How clever is it!
Try 5 agent with the algorithm.
6 minutes! Amazing!
Conclusion
Flaky testing and Long Running test is two big issues on CI. It is said that proper CI should be less than 10 minutes to get proper feedback. If it is longer than that, people give up to wait the execution of test.
If I create the pipeline by my self, It might take very long time to achieve this. I was really surprise that how easy it is. I hope this helps.
Resources
Test in parallel
You can find the concept. You can see the sample for .NET, JavaScript, Python.
For more detail of the VS Test (.NET)
Flaky test
Very good to read about the flaky testing. This is advanced case study by MSFT.