OpenTelemetry

TLDR

How to instrument telemetry

!assets/Untitled.png Private or broken link
The page you're looking for is either not available or private!

Recap

!assets/Untitled 1.png Private or broken link
The page you're looking for is either not available or private!

Notes

Three signals

  • Traces
    • Relational graph visualizing a business transaction
    • Across multiple services and subsystems
    • Chain of spans which include attributes and events
    • Propagation through standardized protocol
  • Logs
    • Unified log data model
    • Support for distributed context propagation
    • Standardized log correlation
  • Metrics
    • Work with existing metrcis protocols and standards
    • Migration path for OpenCensus users
    • Ability to correlate metrics to other signals

Mutliple signals can give us more insight into our system

Make us better spies at understanding what is going on inside the system

Which one when?

It depends

!assets/Untitled 2.png Private or broken link
The page you're looking for is either not available or private!

Where to start?

  • start at the edges of the system
  • where inter-service communication occurs (e.g. between HTTP client, ASP.NET Core, SQL Client, AWS, Azure SDK, Elastic, HangFire, Entity Framework Core etc.)
  • work your way inward, step by step

!Untitled 3.png Private or broken link
The page you're looking for is either not available or private!

When is it too much?

High-quality instrumentation

  • alignment in the organization and team
  • who will benefit from the telemetry data
  • which signals are udes and how are they used?
  • is there room to introduce new signals? new tools?
  • who will be looking at the telemtry information?

Observability strategy

  • what purpose will the telemetry serve?
    • failure investigation
    • performance insights
    • system behaviour
  • what questions should the telemetry be able to answer?

Observability questions to ask

  • can you easily connect a user to your telemetry
  • what types of behaviour do you want to group?
  • can you identify the most load generating operations?
  • are you able to debug this from the telemetry alone?
  • can you find suspicious events throughout the system?

Observability guidelines

  • which signals are available to us?
  • which scenarios best fit each signal?
  • do we need more than one signal? Why?
  • when is it enough?

What best practices should be considered?

Handling error scenarios

  • capture any possible failure scenario
  • signal failures in the observability backend
  • include error information
  • consider adding additional information in case of failure
  • instrumentation shouldn’t cause business failures

consider context propagation

  • create baggage, which is a standard to propagate context across services
  • check for different signals

use with care

  • check impact on latency due to added data load
  • check what will exist on every hop after it is added

sensitive information

  • keep it out
    • personally identifiable information
    • financial or healthcare data
    • passwords, connection or infrastructure information
    • do code reviews to keep all sensitive information out
  • encrypt, tokenize data, mask and redact

processors

  • manipulate data before it is exported to the backend
  • enrich or filter
  • different types of processors per telemetry signal
  • beware if performance & latency impact

telemetry correlation

  • each individual signal provides value
  • connecting the signals further strengthens observability
  • reduces mean time to resolve
  • connect logs and traces through TraceID and SpanID
  • consider capturing TraceID in exemplars

consider cardinality explosion

!assets/Untitled 4.png Private or broken link
The page you're looking for is either not available or private!

Additional best practices

!assets/Untitled 5.png Private or broken link
The page you're looking for is either not available or private!

What do you really need?

  • i want to be able to debug any request in the system
    • observe all
  • i want to debug system-wide problems
  • i want to understand the baseline behaviour
  • i want to understand the system architecture

Choose random samples

Head sampling

  • decide at the beginning whether to sample or not
  • root span makes the decision
  • default: parent-based with TraceIDRatioBased sampler
  • main benefit is unbiased sampling
  • remember: spans are still created, not recorded

Tail sampling

  • collect all spans until the trace completes
  • defer decision to the end of the trace

    based on:

    • overall trace duration
    • status
    • span attributes’ values

Which sampling strategy?

!assets/Untitled 6.png Private or broken link
The page you're looking for is either not available or private!

OpenTelemetry Collector

A mediation layer between your application and your observability backend.

  • vending-agnostic implementation
  • receive, process and export telemetry data
  • centralizes configuration and management
  • offloads risk of failures & handles error scenarios
  • allows scaling per signal

Agents and gateway deployment model

!assets/Untitled 7.png Private or broken link
The page you're looking for is either not available or private!

Which to choose

  • be mindful of how you define skill
  • do not overdo it from the beginning
  • look at what data is coming in
  • adjust for the skill that you actually have, not for the skill that you think you will at some point have
  • add complexity as you need it

Sources

JetBrains .NET Day Online ‘23 - OpenTelemetry - Starts at around 1:05:22 / 11:22:22

talks/how-to-effectively-spy-on-your-systems at main · lailabougria/talks (github.com)

The Best (and Worst) Reasons to Adopt OpenTelemetry - DevOps.com

[Getting Started OpenTelemetry](https://opentelemetry.io/docs/instrumentation/net/getting-started/)

LinksLinks
Read this story from Do Tran on Medium:

OpenTelemetry with .NET Core: A definitive guide:

https://levelup.gitconnected.com/opentelemetry-with-net-core-a-definitive-guide-9638b7db7afe