Distributed tracing: The definitive guide

Distributed tracing: The definitive guide

Sprkl
Sprkl 14 Min Read

Introduction

Many new software and architectures are available on the market, which allows businesses of all sizes to benefit from their innovation and ease of use. They offer a variety of innovations to businesses of all sizes, as well as assistance in changing their operations so that they may give a superior customer experience and achieve the agility necessary to meet their objectives. 

It becomes more difficult to detect and diagnose flaws in the environment as a result of the large number of complex software designs that are being employed, making the entire environment extremely complicated. In some cases, this influences the customer. Numerous performance concerns arise as a result of this, including the inability of many users to access the system or the environment, among other things.

As organizations are using increasingly complex environments that are made up of different microservices and deal with many different systems in the backend, monitoring and troubleshooting the systems become extremely slow because it is very difficult to diagnose the error in the complicated environment. As a result, the customer experience suffers, which directly impacts the organization.

What is tracing?

      Source

As a result of applications being used by many different users, tracing is a method that developers use in conjunction with other forms of logging to obtain more information about an application’s behavior. Tracing is used to keep track of the actions or events that take place within an application. It records a great deal of information, including the timing of an event, what type of endpoint is being accessed or changed, and other information. 

Developers must include code snippets in their applications that generate traces of the actions that they need to monitor. Consider the following scenario: a user is querying a database, and the database connection is failing due to a modified request. In this case, the developer can enable tracing, which will wrap the request and record what time the request was made, what type of error was generated, and what response the users received.

Nevertheless, when dealing with a distributed software architecture that includes several types of microservices and a large number of components, typical tracing approaches become ineffective in terms of debugging the system.

What is distributed tracing?

While traditional tracing approaches can be implemented by simply including an additional snippet in the code, they pose difficulties in a distributed context when many separate systems, such as microservices, are running independently of one another. 

Since a single service is running on multiple servers located in different environments, and since a large number of requests are being processed at the same time, making it nearly impossible to track these requests using traditional tracing methods to determine their origin. 

As a result, distributed tracing is required to track the progress of these complex transactions. In fact, distributed tracing can be used to tackle all types of performance issues since it can trace all of the requests from every microservice or module that is being used and provide an end-to-end narrative for each request. 

So, developers can now monitor the performance of their applications, identifying which requests are causing errors and which requests or data are causing the app to slow down. It also helps them identify what countermeasures can be implemented to resolve these issues. 

When to use distributed tracing

In distributed tracing, a single trace is made up of a sequence of time intervals that have been labeled and are referred to as spans. These spans have a start time and an end time. And they usually contain some information about the metadata, such as logs or tags, that indicate what has occurred. The business can create a view of the complete life cycle of the request that has been propagated and processed in the system because these spans are linked together by a traceId.

Business organizations require this form of distributed tracing to assist them in streamlining applications in a complicated application environment. When we talk about distributed systems or applications, there may be a large number of points of failure throughout the entire application stack to consider. 

In such cases, distributed tracing is utilized to trace a sequence of logs to identify the root cause of the issue and assist the application in maintaining the availability of the application during this process.

It also assists the organization in understanding how each microservice is performing, allowing the organization to resolve issues more quickly. Additionally, it makes the application more cost-effective for the organization and frees up a significant amount of manpower, allowing employees to work on other tasks and innovate.

How does distributed tracing work?

Source

Distributed tracing is very crucial for monitoring, debugging, and improving a distributed software architecture. It monitors a single request by collecting and evaluating the data on every interaction of the request.

There are a couple of different components of distributed tracing. A request is a mechanism via which applications, microservices, and functions communicate with one another. A trace is information about the performance of requests as they pass through microservices. In a trace, a span represents the operations or segments that are included in the trace. A root span is the first span in a trace and is also known as the starting span. A child span is a subsequent span that can be nested within another span.

For example, when a user accesses an API to retrieve various types of information, the interaction begins when the user submits a request to fetch information. Within a short period, the request moves through various systems to fetch the specified information. As a result, each interaction with the system is recorded and assigned with a traceId, allowing the interaction to be tracked more easily. 

Upon entering the system, each request is assigned a trackingID. As the request moves through the various systems, different spans are created—one is the parent, and the other interactions are known as the child spans. These spans are all linked to a trackingID, and all of the information can be retrieved by using the trackingID, which contains the information collected in the parent and the child spans.

How is distributed tracing used in microservices?

Source

As microservices are extremely lightweight and simple to use, they are being used in a growing number of applications and architectures. They also assist developers by making it easy for them to release updates and test them as rapidly as possible. Since they are run on a distributed backend and because a single request may call numerous services, troubleshooting them can be extremely difficult.

End-to-end distributed tracing is used to visualize the complete route of a request from the backend to the frontend, and each traceId is associated with a specific request or response. To process the request, a series of subsequent requests are sent to other systems, which are all recorded as spans and are associated with the trackingID that is generated when the request arrives at the systems.

When using end-to-end distributed tracing, data gathering begins as soon as the request is made, such as when a user uses an API to retrieve the information they are looking for. Instantaneously, when the user submits the request, a unique traceId and initial span are created, which is referred to as the parent span. 

It provides the full execution route of the request, and each span in this trace represents a single unit of work, such as the API completing a query. After this, the request is sent to an S3 bucket and other microservices, and then, finally, to the database to retrieve all of the requested resources.

For example, a top-level child span is formed when a request enters a service, and this child span serves as the parent span if several requests have been made to the same service. When using distributed tracing, each child span is encoded with the original traceId, a unique span ID, the duration of the trace, and any error data that is relevant to the trace.

Following the completion of the requests, all the spans are graphically visualized with the parent span at the top and the child spans nested below it in the sequence of occurrence. The time of each span is displayed, allowing engineers to readily visualize the time spent by each request in any microservice or data and check which requests are taking longer than expected and which require troubleshooting.

How to implement distributed tracing?

Since distributed tracing manages a large number of microservices, it can only be implemented using the OpenTelemetry framework, which is both vendor-neutral and open source. A global standard for distributed tracing is being developed by the cloud-native computing foundation as part of a larger initiative.

They provide several different APIs and agents that can be utilized for distributed tracing applications. Zipkin will be used as the backend for this project.

For the testing, we will use a sample application called spring-petclinic. Let’s start by installing the application.

git clone http://github.com/spring-projects/spring-petclinic.git

cd spring-petclinic

./mvnw package

Following the installation of the application, we must also install the open telemetry agent. An agent is a collector instance that is operating in the same process as the application or on the same computer that the application is running on.

So let’s get started with installing the Java agent.

curl -OL http://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent-all.jar

Once the agent has been installed, we will be installing Zipkin to demonstrate everything.

docker run -it –rm -p 9411:9411 openzipkin/zipkin

Once Zipkin has been installed, we must specify the name of our service in the Open Telemetry service attribute. This can be accomplished by executing the following command.

OTEL_RESOURCE_ATTRIBUTES=service.name=TestService OTEL_EXPORTER=zipkin java -javaagent:aws-opentelemetry-agent.jar -jar target/*.jar

Once it has been done, the Java agent will run with our application, and it will start tracing all the requests that are being sent. You can navigate to the application we have installed for some time, so some of the requests get logged into the application agent.

Then, you can open the Zipkin UI—http://localhost:9411/—and it will have all the traces of the request that have been sent to the application while you were using that. If you click on any of the requests, you will see all the traces that have been made. Once you click on any of the requests, you can easily see all the requests corresponding to it.

Considerations for distributed tracing

When establishing distributed tracing in a microservices architecture, there are a few considerations that you must takeaccount. First, when performing distributed tracing, it is essential that you employ end-to-end instrumentation, which collects traces of all inbound and outgoing service calls so that you may later use them for other purposes.

Second, maintain a record of SRE. Golden RED signals, such as response errors and duration metrics, align with signals such as latency, traffic, errors, and saturation, and are used to set up alerts on them while monitoring the systems. These metrics are important for studying system behavior and troubleshooting because they provide information about how well the system is performing.

Third, there should be a record of all customized business metrics and other tracing operations performed for future reference. And, finally, always make use of tools that are compliant with international standards and adhere to the OpenTelemetry standards.

How distributed tracing can help developers

As we all know, distributed tracing assists developers in a variety of ways. For example, as we all know, distributed tracing monitors a request from the time it enters the system until the data has been returned to the developer. 

As a result of its extensive functionality, a large amount of data is encoded and tracked using a traceId. It is simple for developers to take advantage of this information to lower the mean time to repair (MTTR). Using this information, they can quickly identify the underlying root cause of any problem that has occurred during the processing. It also provides information about the customer’s overall experience.

Developers may more quickly assess the overall health of a system. The communication across DevOps teams is enhanced as a result of this capability.

Distributed tracing vs. logging

While distributed tracing and logging appear to be the same in terms of the fact that they all generate logs and are used for troubleshooting, there is a significant distinction between the two. A log management system is a mechanism that uses the logs generated by an application to centrally track all data, such as errors, requests that are being handled, and other information. 

Logging is used to check that the system is operating effectively and that all the resources are being accessed properly. They can also be used for various other applications, such as producing an alert based on a certain log entry in a database. 

When logging a process, each transaction of the process is recorded independently. For example, if many systems are accessed by a single request, all of the logs for each system will be logged separately. 

Distributed tracing is a procedure that allows you to track a single transaction from one endpoint to another. For example, rather than being stored independently, everything that happens from the time a request enters the system until a response is provided to the user is saved against a trackingId. 

As distributed tracing generates a large amount of different metadata, it can be used to provide comprehensive visibility into the application’s performance across all of its microservices and containers. It also reduces the time required for error detection, which means identifying the root cause can be accomplished quickly.

Distributed tracing vs. monitoring

In the normal context, monitoring is much more specific, such as instrumenting an application and then collecting, aggregating, and analyzing metrics to learn more about the system behavior. Monitoring of the system is done primarily to notify developers when the system is not functioning properly or is acting abnormally. However, they also serve to display the overall health of the systems that are being used in the microservices architecture. 

Generally speaking, when an application is deployed in a distributed systems architecture, monitoring shows several aspects of the application’s performance, including CPU use, other disc consumption, and how a request is handled.

Monitoring is usually used in conjunction with distributed tracing, as distributed tracing is used to track a transaction from beginning to end. In the meantime, it might pass through a lot of different systems or microservice. From the time a transaction takes to execute a request to the resources it accesses, everything is tracked and stored in a massive quantity of metadata. This data is used to analyze the root cause of any problem that is detected.

Distributed tracing vs. profiling

Since distributed environments run a large number of microservices, profiling is used to monitor the health of the system to deliver a decent customer experience and detect outages. In addition, many different profiles, ranging from CPU usage and file I/O to higher-level metrics such as throughput and latency, are created. 

During the monitoring process, these profiles display a clear picture of the system, including whether the CPU usage is exceeding the defined limit and how I/O is affecting the file system. As a result, profiling provides an overview of the system statistics. However, in the case of distributed tracing, it emphasizes the temporal component of performance fluctuations, demonstrating when and where performance is obtained in the code. 

As the request is passing through multiple microsystems and we have a large amount of metadata allows us to easily detect errors or issues in any of the microservices or the system. While profiling can generate alerts, distributed tracing is being used to determine what caused the alert to be generated in the first place.

Conclusion

It is possible to solve the problem of logging and tracing in distributed systems when traditional tracing does not function properly through distributed tracing. Since it logs each request from beginning to end, distributed tracing makes it much easier for engineers to identify and fix any bottlenecks that may occur in the system as they appear. It also assists in identifying the root cause as early as possible so the developer only has to fix the problem, leaving them with more time to devote to other aspects of research and development.

Share

Share on facebook
Share on twitter
Share on linkedin

Enjoy your reading 14 Min Read

Further Reading