Solving Memory Leak issue on KEDA v2 with the power of the OSS community
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events need to be processed. KEDA support a lot of scalers. I contribute to develop ScaleJobs
V2 controller that is a scale controller that enables us to execute Jobs
. We have mainly two kind of applications on Kubernetes. One is Deployment
and the other is Job
. Job
will receive a message and process it. Once the processing has been finished, it complete the execution. KEDA supports a lot of scalers with cloud-agnostic choices. It is build by a lot of community based contributors.
Blocking issue — Memory Leak probably My code
Because of the passionate contribution of the community members, the KEDA is about to release V2 GA version! However, one nasty bug is reported. `Scaled Jobs memory leak`. It is probably because of MY FAULT! My code prevent to go to GA. It was very awkward for me. Let’s solve this!
Memory Profile for Go lang
I research the Memory Profile for go. Go lang has a good profiler. For more details, I wrote a blog “how to make it enable for KEDA”.
Struggle for finding root cause
I thought, once I’ve got a memory profiler, things is not difficult. However, I spent a couple of days. For the memory leak issue, we need to run application for long time to detect the change. Other challenge is, Platform. KEDA supports a lot of Cloud Services. It is impossible to cover all of them by one person.
In this case, the report said that, It happens on the minikube and GCP PubSub scalers. I have a GCP account, however, it is nasty to do the load testing against my personal account. It happens on ScaledJob
but not known if it happens on the ScaledObject
that means HPA based scale.
Wrong Direction
However, it seems accumulate the memory consumption time by time. So probably, the ScaledJob
controller logic. So I decided to go with Azure Service Bus
with Azure Kubernetes
service. Once I thought, “I can reproduce it!”, however, it was false positive. When I run it all night, it increase the memory, however, the design of scalers configuration and receiver design was wrong. My sample uses three queues. the receiver only consume one queue messages. That is why it keep on create Job
s but never can consume the message. Under that circumstances, it increases memory. I compare the profiling map for 30 min running version and 8 hours running version. I found some object increasing the memory. I read the code, identify the spot, however, it was the place that stores k8s object as a part of reconcile loop. If I configured it properly, I couldn’t reproduce the memory leak.
I run it for a long time to detect the leak with the correct configuration, the memory is not increase that much. If you see the memory analysis, you can’t find very big big memory consumption on it.
Unfortunately, Go expert
was on vacation. I was devastated. I asked my manager Anirudh for his help then he suggested:
1) Ask the Principal Programmer on the team — Anatoli
2) Ask the Person who got the issue to report it to reproduce it and give you an environment
I decide to ask the general solution of the Memory Leak to the one of the best programmers on my team.
Anatoli’s Advice
He advice me very important points for solving memory leak.
Reproduce the issue with exactly the same environment
Narrow it down using the Profiler
Once find the spot, disable it and compare if it solve the issue
I use the Azure Service Bus
and Azure Kubernetes Service
for reproducing the issue, however, I failed. Before guessing something, I should have reproduced it first, then I should have narrowed it down.
There is several reasons that I can’t reproduce the issue. So that before thinking, I should have reproduced with exact the same environment at first. As Shu-Ha-Ri principal says, I followed the principal. However, I don’t know much about GCP PubSub. It might take time to learn/config/develop the test tool. It might miss the deadline.
I was little hesitate it since I’m a Japanese. As a Japanese, we never ask the users to help me to reproduce such a way. We have no custom to do like this. However, here is the U.S. It might be little different. Also, I remember this is not our project. It is an open source project, I’m just a contributor of this project. I trust Anirudh, and I also use Shu-Ha-Ri principal, I decided to go with it.
Community members help me
I chat on the slack with the community leader Zbynek, I told him that I was struggling the reproduce. Then he said, we can try to ask.
and we discuss it on the Slack and the reporter helped me!
When I got stuck for a couple of days under the pressure, (actually, no one pushed me, though), I felt that I need to solve this only by myself in this world. However, my colleagues help me, and community member helps me, even if the issue reporter! This is something I can’t experience in Japan.
Reproduce on the exactly the same environment
As Anatoli said, we need to reproduce
with the exactly the same environment as the reporter. This time, it is “exactly the same”. Reproduce is already done. I create a KEDA binary with profiler injected and create a Docker image. For more detail of how to inject the profiler, you can refer to this blog post. Then from the directory that is cloned KEDA v2, you can find a Docker file. Do like this.
$ docker build -t tsuyoshiushio/kedav2:0.0.1 .
$ docker push tsuyoshiushio/kedav2:0.0.1
That’s it. I asked him to modify the image part of the helm chart, then mount desk for getting the report from the docker.
Borrowing Oren’s brain power
Actually, the reporter, Oren is very clever and great developer. He understand everything and logically structured the tests and send me a several profiler’s reports. I’m very impressed with his cleverness and kindness. This analysis is the KEDA v2 with GCP PubSub’s profiler’s map.
As you can find a big one is there. Compared with my profiling map, this guys are exceptionally big. On my map, there is no such object that has over 1MB.
And he said
I don’t write Golang (yet), but it looks like GetSubscriptionSize creates a new client instance on each call?
At least in the python implementation, they recommend you re-use the same client.
I said That's absurd!
lol. How clever he is. Even if he don’t write go lang!
According to the Map, newBufWriter
is big deal and that parent is newHTTP2Client
That’s make sense. Compared with my diagram it was not big that much.
TBH, GCP PubSub was my blind spot. I was biased that the issue must have been on my code. That is why, I crossed-out the possibility. I should think more logically.
I read the code, including the internal. However, I couldn’t find exact the reason why it leaks, however, I remember the Anatoli’s third advice.
Once find the spot, disable it and compare if it solve the issue
So I assume that the instantiation is the suspicious spot, so I make it live in their life time.
I build it with profiler and push it to the DockerHub again, ask Oren to test it again. After chatting with him on the Slack, he send me a message.
Oh! Yes! I’m very grad to hear about it. I really appreciate him. He is the hero.
Conclusion
Latest software is complex. It might not something one person can solve everything. However, the great community collaboration and contributors can solve the difficult issue in such a short time. I appreciate all the contributors. I also solve the other severe issue with collaborative way. I believe collaborative way of the development style accelerate the world of the development more. At least, I love the development community.