Distributed System Logging Best Practices (1) — Overview

Tsuyoshi Ushio
3 min readJan 12, 2023

I wanted to know the best practices how for writing logs. However, I can’t find the best practices for it. It might not be mature, however, I started writing the best practices by myself. Any feedback is welcome.

This Blog post consists of three parts.

  1. Overview
  2. How to write logs?
  3. How to use logs?

I assume the audience will be beginner — intermediate developers who work on Distributed Systems.

What is the purpose of logging?

Logging is essential because:

  1. Helps to solve problems in production
  2. Observe, Measure, and Validate the behavior of your system
  3. Automate the finding issue or mitigate/solve the issue in production

If you work on a distributed system, you will realize that logging is very important. You want to develop software more, however, just imagine if you had an incident in production. If you are working on the DevOps model, you need to look after your system like me. You might want to spend more time on coding. However, if you have a bunch of incidents, it eats up your coding time.

Why is logging important?

If the log were poor, it is impossible to find the root cause. Especially, the Distributed system, consists of several microservices, and without good logging, it is almost impossible to investigate the incident.

The following list is, how long does it take to do it?

Leadtime differences
  1. Log search
  2. Code reading
  3. Browse Database
  4. RDP/SSH
  5. Reproduce
  6. Add more logs and deploy it

Sometimes, the order is different, however, we want to avoid any costly solution as much as we can. Solving incidents by log search is the minimum amount of times among the solutions.

Logs are not only used for solving live site incidents but also used for observing, measuring, and validating the behavior of the system. Imagine, you are writing a Scale Controller that controls the behavior of scaling, you might test and validate the behavior of scaling. It can’t be achieved only with Unit/Integration testing. By using Logs, you can observe the system's behavior and how it scales.

Lastly, Logs are useful for Automation. Using a logging system, you can automate incident mitigation and resolution. If you spot a particular logging that shows a known incident, you can automate the migration/fix for example by rebooting a VM or calling Rest API. If you do it manually, you need to always spend time on it for approval, using a special machine to access the production environment, and so on. It usually takes time. That action helps to save your coding time.

Conclusion

I wrote about the importance of Logging on Distributed System. In the following blog post, I’ll share the best practices of how to write logs.

--

--