Distributed tracing

Context

You have applied the [[Microservice architecture]] pattern. Requests often span multiple services. Each service handles a request by performing one or more operations, e.g. database queries, publishes messages, etc.

Problem

How to understand the behavior of an application and troubleshoot problems?

Forces

  • External monitoring only tells you the overall response time and number of invocations - no insight into the individual operations

  • Any solution should have minimal runtime overhead

  • Log entries for a request are scattered across numerous logs

Solution

Instrument services with code that

  • Assigns each external request a unique external request id

  • Passes the external request id to all services that are involved in handling the request

  • Includes the external request id in all log messages

  • Records information (e.g. start time, end time) about the requests and operations performed when handling a external request in a centralized service

This instrumentation might be part of the functionality provided by a [[Microservice Chassis]] framework.

Examples

The Microservices Example application is an example of an application that uses client-side service discovery. It is written in Scala and uses Spring Boot and Spring Cloud as the [[Microservice chassis]]. They provide various capabilities including Spring Cloud Sleuth, which provides support for distributed tracing. It instruments Spring components to gather trace information and can delivers it to a [[Zipkin]] Server, which gathers and displays traces.

The following Spring Cloud Sleuth dependencies are configured in build.gradle:

[[RabbitMQ]] is used to deliver traces to [[Zipkin]].

The services are deployed with various [[Spring Cloud Sleuth]]-related environment variables set in the docker-compose.yml:

This properties enable Spring Cloud Sleuth and configure it to sample all requests. It also tells Spring Cloud Sleuth to deliver traces to [[Zipkin]] via [[RabbitMQ]] running on the host called rabbitmq.

The Zipkin server is a simple, Spring Boot application:

It is deployed using Docker:

Resulting Context

This pattern has the following benefits:

  • It provides useful insight into the behavior of the system including the sources of latency

  • It enables developers to see how an individual request is handled by searching across aggregated logs for its external request id

This pattern has the following issues:

  • Aggregating and storing traces can require significant infrastructure

[[Log aggregation]] - the external request id is included in each log message

See also

  • Open Zipkin - service for recording and displaying tracing information

  • Open Tracing - standardized API for distributed tracing

Last updated