Distributed System Logging Best Practices (2) — How to write logs?

Tsuyoshi Ushio
7 min readJan 12, 2023

--

Described the importance of logging on to the previous post. This article shows the best practices for logging messages.

AI observes several repositories and extracts the pattern I frequently see. Any feedback is welcome.

How to design log messages?

Before moving on to the best practices, how can we design the log message? My answer is simple.

Imagine the use case for a log, Give the information for it

For example

  • If you had an issue in a part, which data would you want to browse to solve the issue?
  • If you want to know the behavior of your system without RDP/SSH, which logs are useful?
  • If you want to auto-mitigate an issue from logging, which information is required?
  • If someone else read a log, what action would you want them to take?

Be careful logs are not for you but the other developers. Don’t write log for you. That would inject a context that only you know it.

Best Practices

Currently, I’m organizing the best practices as follows. I’d like to explain each best practice later in this article.

Logging Best Practices

Search/Analysis

  • Unique message
  • Trace ID
  • Log Level
  • UTC
  • Full Stack Trace
  • Actionable Logging
  • Typed/Prefixed Logging

Event

  • Event Started
  • Deployment Time
  • External Interfaces

State

  • Aggregation
  • Versions
  • State change events
  • Secret masking
  • Logical Entities

Behavior

  • Logger Mocking

Search/Analysis

Unique message

Log messages should be unique on a microservice. If you have several prices that emit the same message, how do you identify which is which? Even if you can spot it by combining the other filters, it will increase the cost of log search and requires a complex query.

Trace ID

If you are working with a Distributed System, Trace ID is important not only for tracing but also for logs. For example, if you have a front end, it sends a request to component A, then dispatches the request to a worker. Customer seeing 500, how do you identify the root cause? Which request is related? If you have the same Trace ID on the logging on the front end, the component A, and the worker, you can easily analysis and correlate the issue by filtering by the Trace ID.

Log Level

You might want to see a lot of logging messages to analyze the issue/behavior, you can add a bunch of log messages. However, it is not good practice. A logging system has thresholds. Sometimes, very verbose logging eats up the resources. Also, if you set the wrong log level, it will be very annoying. For example, I use a tool that outputs INFO-level information as an Error level. We know it is a fake Error, however, we see the log that is colored RED is very annoying. If someone who doesn’t know it sees the logs, they think it would be a root cause of an issue.

UTC

In a servicing loop, we should talk time with UTC. You might think, it is easy to understand your timezone’s local time when you talk with a customer, however, it can be very confusing. If your colleague is living in India and he helps you, which “tine” would you want to pick? UTC is the most peaceful solution. Don’t confuse people by using local time.

Full Stack Trace

When an exception happens, some bad code eats it up. In these cases, it is very hard to identify the root cause. If it is a known issue, you might omit the Stack Trace however, what if you had the same exception but a different root cause? It is very hard to predict. Once You’ve got an exception, Sharing the Full Stack Trace with your message that shows context will be a promising idea.

Actionable Logging

The log should not be written for you. Some people write logs that can be understood only by the author. I see the log and am wondering “what does it mean?” Instead, the best logging could be “Actionable” logs. If you don’t touch the repository and see the Exception, you don’t know what to do. You might find it by browsing the code and spotting the point. Sometimes, the Code might not be on GitHub. So that Exception and Stack Trace is not oblivious to the non-author of the log. Some good Exception messages include the potential root cause and action plan. That helps to solve the problem very much. I recently saw the following log on Service Bus SDK.

This example not only shows how to analyze the Exception and indicates a document that shows the potential issue and fixes it. Consider the log reader as a client and which action plan would you want them to take?

Microsoft.Azure.ServiceBus.MessageLockLostException : The lock supplied is invalid. Either the clock expired, or the message has already been removed from the queue. For more information, please see https://aka.ms/ServiceBusExceptions

Typed/Prefixed Logging

In a microservice, it could be possible, it has separate logical functionalities but shares the classes. The logging adds the namespace/class name; however, several logical functionalities have been included in classes. It might not happen when it is well-designed, however, legacy code is not always ideal. In this case, you can simply add the prefix. for example, A component that bootstraps several services. One of them is Scale Controller. However, bootstrapper logging includes several services. In this case, you can add [Scale Controller] a prefix for the logging or add the service type column. That helps to remove all the other noises and focus on what you want. Typed logging is better than Prefixed since the query speed is faster than another.

Event

Event Started

When your system starts an event, it might be a good idea to show it with Info level. When you diagnose something in production, this information is very helpful. For example, if we see the server started the event on a log, we will know the startup time of the server started, and observe, if there is no weird behavior, happening on the start-up time.

Deployment Time

Deployment Time is one of the best information on a servicing loop. It saved my life several times. Usually like following.

We encountered a serious incident. The customer is very hot, and we have big pressure. A lot of clever engineers are gathering; however, no one can identify the root cause, just as time passes by… At that time, Finbar, one of the best servicing guys says, “Hey we can roll back the runtime. The issue is happening after the deployment.”

We have no evidence but rolled it back, it mitigated like magic. After that, we can find the root cause, because we can narrow down the issue because the issue should exist in between two versions.

External Interface

We should write logs when we use an external interface. It can be File I/O and can be REST API calls and others. In the distributed system, Internal of the microservices, it works pretty well, since we can write unit tests. However, fragile parts will be the boundary of the system. We can’t control the other system. Hexagonal Architecture helps to identify the external interface. It also can be a target of mocking.

State

Aggregation

Logging can be expensive if you log everything. So that we have a log level. However, sometimes, you might want to see the detailed behavior on the INFO level in production.

In this case, you can consider aggregating. For example, if you want to see the detailed behavior of the file read for each iteration and it is too verbose, you can aggregate the information and log after the whole iteration has been finished. The information could help to understand the behavior without going verbose.

Versions

Versions are super important. An issue could happen in version 1 but version 2 has a fix for it. In this case, if the customer uses version 1, just ask to update to version 2. Also, when you investigate the issue by browsing the code, if you don’t check out the same version as the customer’s code, you might miss analyzing the root cause since the code you are looking at is a different one.
When you write a log, at least a microservice should have one version log of your application (or other components that has a version) to check the version the app use. It might be possible to check the version with your CI system, but it can take time and includes several steps. If one query shows the version, your analysis can be narrowed down easily.

State change event

If your system’s state has been changed, it might be a good idea to consider info-level logging. For example, a new worker assignment request is coming, a worker is assigned, and so on. These events would change the state of the system. These events help to analyze if the system behaves as expected.

Secret masking

Be careful to expose secret information, especially of customers. For example, Connection String, Password, and so on. If possible, detect the secret of logging by scanning the logs.

Logical Entity

We are living in a cloud world. We have multi-layered virtualization. A container might work on a Nested VM and the NestedVM works on a Virtual Machine. A logical deployment unit might group the virtual machine. In this case, if you are writing a log message for the app that runs in a container, it might be a good idea to log the Container/NestedVM, VM, and Logical Deployment Unit.

Behavior

Logger Mocking

We can use Log as Unit Test and Integration testing. If you assert the system behavior, especially concurrent application, a log can be helpful to assert the behavior. Mock the logger and run your system. Steal the log and validate it.

Conclusion

I listed the best practices I frequently see on production code. If you know other important best practices, please let me know.

The following repository includes samples of the logging message patterns.

LoggingSampleFunction/Function1.cs at master · TsuyoshiUshio/LoggingSampleFunction (github.com)

--

--