Getting Started with Grafana Tempo

Understanding complex, distributed applications has become increasingly difficult – just tracking metrics and logs provides an incomplete picture. This lack of observability causes countless hours lost to debugging performance issues or outages.

Distributed tracing provides granular insight into requests and workflows spanning multiple services, uncovering the root cause of problems. However, legacy tracing solutions like Jaeger and Zipkin can be complex to operate, are expensive to scale, and lack interpolation with other data.

This is what makes Tempo so valuable – it‘s an easy-to-use yet powerful distributed tracing backend purpose-built for the scale and cost-sensitivity of cloud-native systems.

As companies adopt microservices patterns, Tempo solves major pain points around understanding complex systems by providing complete, storage-efficient traces. Deployments have grown rapidly based on these strengths. But what exactly makes Tempo different? Let‘s dig in and get hands on with the basics…

Why Tempo Takes Off Where Legacy Tracing Falls Short

While companies have embraced open standards like OpenTelemetry for instrumentation, legacy backends have struggled to keep up, leading to major gaps:

Cost – Traditional tracing architecture requires retaining 100% of spans in expensive indexed storage, unable to leverage cheap object storage. This causes massive cost inflation over time without downsampling.

Complexity – Solutions like Jaeger use multiple databases and moving parts without opinionated best practices. This increases Ops complexity.

Analytics – Lacking native support for logs or metrics correlation makes deriving insights challenging.

Tempo runs counter, architected specifically for high scale, cost efficiency and analytics in modern environments:

  • Optimizes storage for cheap S3/GCS object storage
  • Simplified components and workflows
  • Native Grafana support for correlation across data sources

Recent major updates like 35x query performance improvement, batched uploads, and Prometheus support further cement Tempo as a leader in open source tracing.

Inside Tempo‘s Distributed Architecture

Tempo separates concerns into specialized roles while coordinating to provide a complete view of systems:

Distributor – Single ingress point for accepting spans sent to Tempo from instrumented services via OpenTelemetry or other formats:

traces -> Distributor -> Ingester

This shields Ingesters from traffic spikes. The Distributor can scale out easily as volumes increase.

Ingester – Buffers incoming spans, aggregates them into blocks, performs lightweight compression, and writes blocks out to storage. Runs as stateless replicas.

Store – Persists compressed blocks of traces over time. This is typically S3 or GCS object storage. Data can be queried in-place or streamed via compaction.

Querier – Serves queries to external clients – retrieves traces by ID from Ingesters or Object Store and performs decompression. Horizontally scalable.

Compactor – Combines blocks of traces in the store to provide efficient access. This process allows cost-effective retention of 100% of traces over long periods.

This architecture crucially allows linearly scaling each component independently to handle high volumes of traffic while leveraging cheap storage.

Here‘s an overview:

[Tempo architecture diagram]

Now let‘s run through bringing up a local Tempo instance…

Running a Tempo Instance Locally with Docker

To get hands-on with Tempo, we‘ll deploy it locally using Docker. This allows quickly experimenting before production setup:

Step 1 – Create Network

docker network create docker-tempo

Step 2 – Initialize Configs

First grab configuration files from GitHub:

curl -o tempo.yaml https://raw.githubusercontent.com/grafana/tempo/master/example/docker-compose/etc/tempo-local.yaml

curl -o tempo-query.yaml https://raw.githubusercontent.com/grafana/tempo/master/example/docker-compose/etc/tempo-query.yaml  

Step 3 – Start Tempo

Next we start the Tempo server container using the config:

docker run -d --rm --name tempo \
   -v $$(pwd)/tempo-local.yaml:/etc/tempo-local.yaml \
   --network docker-tempo \
   grafana/tempo -config.file=/etc/tempo-local.yaml

Step 4: Start Query Service

The query service exposes a GRPC endpoint for clients to query for traces by ID from ingesters or storage:

docker run -d --rm \
   -p 16686:16686 \
   -v $$(pwd)/tempo-query.yaml:/etc/tempo-query.yaml \
   --network docker-tempo \
   grafana/tempo-query \
   -grpc-storage-plugin.configuration-file=/etc/tempo-query.yaml

This exposes port 16686 for the Jaeger UI.

We now have a basic Tempo instance up – next we‘ll look at sending sample traces…

Generating Traces from Sample Services

To demonstrate tracing end-to-end, Tempo provides sample instrumented services along with a load generator.

Let‘s walk through running these locally as a source of traces:

Step 1 – Initialize Docker Compose

git clone https://github.com/grafana/tempo 

cd tempo/examples/docker-compose 

docker-compose pull

Step 2 – Start Applications

docker-compose up

This brings up:

  • synthetic-load-generator – continuously generates and sends spans imitating a user workflow through services
  • frontend – simple Go web app with OpenTelemetry instrumentation
  • checkout – Python service called by frontend during "checkout" flow
  • product catalog – Java service holding product data
  • payment – Node.js based payment processing
  • delivery – Go-based mock shipping service
  • Prometheus – collects metrics for services
  • Grafana + Tempo – traces pipeline + visualization

With a local Tempo pipeline handling spans emitted from a sample polyglot application, we have an end-to-end setup reflecting a real production environment!

Next let‘s look at querying traces in Grafana.

Analyzing Traces Grafana with Tempo Data Source

A key benefit of Tempo is native integration with Grafana for analysis alongside metrics and logs.

Let‘s walk through setting this up:

Step 1 – Configure Tempo DataSource

Navigate to Grafana and add a Tempo DataSource via the UI:

  • Set URL to http://tempo-query:16686
  • Select Access of Browser
[Grafana datasource config screenshot]

Step 2 – Explore Traces

We can now query traces from the Explorer. Let‘s view trace duration by service:

{serviceName=frontend} | tempoMetrics

[Explorer query result screenshot]

We could also filter or group by endpoints, latency buckets, attributes or resource metadata to slice and dice trace metrics any way we want.

Step 3 – Visualize Services Map

For high-level systems visibility, a Services Map panel shows communication topology and latency between all services:

[Services map screenshot]

The native integration here allows leveraging Grafana‘s visualization power with Tempo‘s traces.

Now that we‘ve covered the basics of running Tempo and working with traces – let‘s discuss key considerations moving to production.

Running Tempo In Production

So far we walked through a local Tempo setup – but real production deployments introduce further challenges:

Scaling Durably

To scale ingestion and queries, Tempo runs horizontally scalable Distributors, Ingesters and Queriers. These handle high throughput by dividing loads. Capping resource usage prevents overload.

For durability, deploying multiple Ingesters and Query nodes protects from outages. Heartbeating detects failures quickly for failover.

Ensuring Cost Efficiency

The biggest cost driver is trace data storage. By using S3 or GCS object storage, Tempo achieves major cost savings over indexing entire spans. Intelligently compacting blocks increases efficiency further.

Ingesters perform shallow compression before writing to reduce blobs while keeping them queryable. Production users report up to 90% storage savings.

Correlating Beyond Traces

Tempo integrates with Prometheus for instrumentation and loki for logs. This provides a holistic view across tiers by interpolating or filtering on context from traces. Native support in Grafana eliminates overhead of joining external sources.

Optimizing Query Performance

Query optimization starts with instrumenting services for efficiency – avoiding unneeded context propagation and annotation bloat. For backends, query parallelization, caching and compression maximize throughput.

Recent Tempo updates like batched uploads and purpose-built indexes provide order of magnitude speed gains for production workloads.

Retaining Institutional Knowledge

Storing complete traces long term rather than just metrics captures your org‘s unique context. This means onboarding services built years ago won‘t lose irreplaceable tribal knowledge from engineers since the data persists unchanged.

Compaction tiers balance retaining 100% of traces over multi-year periods at low costs – preventing the "metrics amnesia" hitting legacy APM tools.

Key Takeaways

Here are the core things to remember about Tempo:

Easy to operate and learn – compared to alternatives like Jaeger, greatly simplified components, workflows and lower resource overhead

Cost-efficient at scale – leverages cheap S3/GCS object storage and compaction techniques rather than only indexed spans

Analytics-optimized – native integration with Prometheus metrics and Loki logs provides correlation

High performance – recent major updates include batched model, indexes and 35x query speed gains

Complete visibility – captures 100% of traces over multi-year periods for full historical lookup

Tempo makes end-to-end tracing much more approachable and cost-effective for the scale of cloud-native systems. Whether you have existing tracing or are evaluating options, I highly recommend testing out Tempo.

The Grafana labs team has tons of excellent material below on workshops, training and best practices for production deployments:

Hopefully this gives you a solid starting point for understanding Tempo‘s unique strengths and how to get off the ground yourself. Happy tracing!