Solving Memory Leak issue on KEDA v2 with the power of the OSS community

KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events need to be processed. KEDA support a lot of scalers. I contribute to develop ScaleJobs V2 controller that is a scale controller that enables us to execute Jobs . We have mainly two kind of applications on Kubernetes. One is Deployment and the other is Job . Job will receive a message and process it. Once the processing has been finished, it complete the execution. KEDA supports a lot of scalers with cloud-agnostic choices. It is build by a lot of community based contributors.

Blocking issue — Memory Leak probably My code

Because of the passionate contribution of the community members, the KEDA is about to release V2 GA version! However, one nasty bug is reported. `Scaled Jobs memory leak`. It is probably because of MY FAULT! My code prevent to go to GA. It was very awkward for me. Let’s solve this!

Memory Profile for Go lang

I research the Memory Profile for go. Go lang has a good profiler. For more details, I wrote a blog “how to make it enable for KEDA”.

Struggle for finding root cause

I thought, once I’ve got a memory profiler, things is not difficult. However, I spent a couple of days. For the memory leak issue, we need to run application for long time to detect the change. Other challenge is, Platform. KEDA supports a lot of Cloud Services. It is impossible to cover all of them by one person.

In this case, the report said that, It happens on the minikube and GCP PubSub scalers. I have a GCP account, however, it is nasty to do the load testing against my personal account. It happens on ScaledJobbut not known if it happens on the ScaledObject that means HPA based scale.

Wrong Direction

However, it seems accumulate the memory consumption time by time. So probably, the ScaledJob controller logic. So I decided to go with Azure Service Bus with Azure Kubernetes service. Once I thought, “I can reproduce it!”, however, it was false positive. When I run it all night, it increase the memory, however, the design of scalers configuration and receiver design was wrong. My sample uses three queues. the receiver only consume one queue messages. That is why it keep on create Job s but never can consume the message. Under that circumstances, it increases memory. I compare the profiling map for 30 min running version and 8 hours running version. I found some object increasing the memory. I read the code, identify the spot, however, it was the place that stores k8s object as a part of reconcile loop. If I configured it properly, I couldn’t reproduce the memory leak.

I run it for a long time to detect the leak with the correct configuration, the memory is not increase that much. If you see the memory analysis, you can’t find very big big memory consumption on it.

Unfortunately, Go expert was on vacation. I was devastated. I asked my manager Anirudh for his help then he suggested:

1) Ask the Principal Programmer on the team — Anatoli

2) Ask the Person who got the issue to report it to reproduce it and give you an environment

I decide to ask the general solution of the Memory Leak to the one of the best programmers on my team.

Anatoli’s Advice

He advice me very important points for solving memory leak.

Reproduce the issue with exactly the same environment

Narrow it down using the Profiler

Once find the spot, disable it and compare if it solve the issue

I use the Azure Service Bus and Azure Kubernetes Service for reproducing the issue, however, I failed. Before guessing something, I should have reproduced it first, then I should have narrowed it down.

There is several reasons that I can’t reproduce the issue. So that before thinking, I should have reproduced with exact the same environment at first. As Shu-Ha-Ri principal says, I followed the principal. However, I don’t know much about GCP PubSub. It might take time to learn/config/develop the test tool. It might miss the deadline.

I was little hesitate it since I’m a Japanese. As a Japanese, we never ask the users to help me to reproduce such a way. We have no custom to do like this. However, here is the U.S. It might be little different. Also, I remember this is not our project. It is an open source project, I’m just a contributor of this project. I trust Anirudh, and I also use Shu-Ha-Ri principal, I decided to go with it.

Community members help me

I chat on the slack with the community leader Zbynek, I told him that I was struggling the reproduce. Then he said, we can try to ask. and we discuss it on the Slack and the reporter helped me!

When I got stuck for a couple of days under the pressure, (actually, no one pushed me, though), I felt that I need to solve this only by myself in this world. However, my colleagues help me, and community member helps me, even if the issue reporter! This is something I can’t experience in Japan.

Reproduce on the exactly the same environment

As Anatoli said, we need to reproduce with the exactly the same environment as the reporter. This time, it is “exactly the same”. Reproduce is already done. I create a KEDA binary with profiler injected and create a Docker image. For more detail of how to inject the profiler, you can refer to this blog post. Then from the directory that is cloned KEDA v2, you can find a Docker file. Do like this.

$ docker build -t tsuyoshiushio/kedav2:0.0.1 .
$ docker push tsuyoshiushio/kedav2:0.0.1

That’s it. I asked him to modify the image part of the helm chart, then mount desk for getting the report from the docker.

Borrowing Oren’s brain power

Actually, the reporter, Oren is very clever and great developer. He understand everything and logically structured the tests and send me a several profiler’s reports. I’m very impressed with his cleverness and kindness. This analysis is the KEDA v2 with GCP PubSub’s profiler’s map.

As you can find a big one is there. Compared with my profiling map, this guys are exceptionally big. On my map, there is no such object that has over 1MB.

And he said

I don’t write Golang (yet), but it looks like GetSubscriptionSize creates a new client instance on each call?
At least in the python implementation, they recommend you re-use the same client.

I said That's absurd! lol. How clever he is. Even if he don’t write go lang!
According to the Map, newBufWriter is big deal and that parent is newHTTP2Client That’s make sense. Compared with my diagram it was not big that much.

TBH, GCP PubSub was my blind spot. I was biased that the issue must have been on my code. That is why, I crossed-out the possibility. I should think more logically.

I read the code, including the internal. However, I couldn’t find exact the reason why it leaks, however, I remember the Anatoli’s third advice.

Once find the spot, disable it and compare if it solve the issue

So I assume that the instantiation is the suspicious spot, so I make it live in their life time.

I build it with profiler and push it to the DockerHub again, ask Oren to test it again. After chatting with him on the Slack, he send me a message.

Oh! Yes! I’m very grad to hear about it. I really appreciate him. He is the hero.

Conclusion

Latest software is complex. It might not something one person can solve everything. However, the great community collaboration and contributors can solve the difficult issue in such a short time. I appreciate all the contributors. I also solve the other severe issue with collaborative way. I believe collaborative way of the development style accelerate the world of the development more. At least, I love the development community.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store