Observability for Developers: What You Need to Know?
E21

Observability for Developers: What You Need to Know?

On today's episode, we're talking about
observability, which is basically how

you figure out why your microservices
are crying at 3:00 AM in the morning.

We got to chat with Adriana, who's
a principal developer advocate at

Dynatrace, an Open Telemetry maintainer,
and apparently enjoys climbing walls

when she's not instrumenting code.

He's also a CNCF ambassador who
somehow finds time to podcast,

blog, and contribute to open source
while maintaining a full-time job.

I'm exhausted just thinking about it.

We dive deep into OpenTelemetry,
discuss why your observability teams

shouldn't be instrumenting your code,
spoiler developers is your job, and

explore the eternal struggle between
getting good observability data

and not bankrupting your company.

And of course, David asked
approximately 47 impossible

questions about instrumentation
strategies and cost optimization.

Classic David.

Hey, at least I didn't ask about rust.

This time though, we did talk about
environmental impact of observability,

which was genuinely eye-opening.

We also learned that Sweden has
the greenest data centers in

the world, and that swivel chair
observability is not the goal.

Who knew?

Enjoy the episode.

Thank you so much for joining us
today on Cloud Native of Compass.

Adriana, for anyone who is not familiar
with your work or you, could you

please take a moment just to tell us
what you're up to and what you enjoy?

Yeah, sure.

So my name is Adriana Villela.

I am a principal developer
advocate at Dynatrace.

Um, I also have a podcast
called Geeking Out.

Um, I blog a fair bit on medium and
I work in the observability space.

Um, I have for, I guess,
the last couple of jobs.

I'm heavily involved in OpenTelemetry.

I'm one of the maintainers
of the hotel end user sig.

And, um, by night I like to climb walls.

I, I am into bouldering.

I've sustained a few injuries over the
years as a result, including, I guess my,

my most recent severe injury was an ankle
sprain back at the end of October of 2024.

So, uh, when I was at CubeCon in Salt
Lake City, I was semi limping around

because I think I was like two or
three weeks fresh off the injury.

So I, I definitely practice a

That's amazing.

Very cool.

Uh, a great hobby to be doing.

I know there's quite a active bouldering
and climbing community inside of

the Cloud Native community as well.

I know that Dan Neren is always talking
about trying to get people up to her

up hills or rock walls or whatever,
whenever we go to Cubecom, but, uh,

I prefer the, the quiet not breaking
my ankle or spraining my ankle life.

So I'll keep it this way anyway.

I don't blame you.

Yeah, just, I, I, I'm
your stereotypical geek.

I don't, I, I prefer not to
touch grass, but that's, uh, me.

So anyway, let's get back on topic.

There's no point in digressing
right away, off the bat, right.

So you're, uh, you're a wonderful person.

We've met before and I did some research
before we decided to record this as well,

and I ended up on your bio page on GitHub
and the amount of talks and panels and

blogging and podcasting that you do is.

Amazing.

Like there, it's just I can't believe
the output and how much you're doing.

So first of all, thank you for everything
that you're doing for the observability,

OpenTelemetry, and the Cloud Native space.

But I'd love to know just how do you
manage to find the time to juggle

this with a full-time job and be so
present in the Cloud Native community?

I think to be honest, um, I'm
lucky that my job allows me

to build that into what I do.

Um, because otherwise I think I'd need
like clones of myself running around.

Um, because it, it is a lot.

And even even with that, I just, like
my days, my days are definitely full.

But, uh, I think it's really
important, you know, um.

I'm a CNCF ambassador and for me it
was really important when, um, I was

looking for my next role to be able
to build that into what I do build

the, you know, the, the contributions
to open telemetry into the role.

So I was very.

Um, thoughtful in, in my
job selection process.

This is my second developer advocacy role.

So before that, I, I, I mean, I was a
Java software developer for like 16 years.

I've led teams off and on, you know,
I've like manager, not manager,

decided I don't wanna be a manager.

I'm much happier as an IC.

Um, but, uh, you know, when I was
looking for my next developer advocacy

role, a a lot of very nice, well-meaning
friends kept just throwing like

whatever DevRel role they, they saw,
and it's like database DevRel and like

networks related DevRel, and it's like.

Cool.

But that's not really my passion.

So I wanted to make sure that I got to
continue doing the same things that I was

super passionate about and interested in.

And I mean, I've spent, uh, you know,
I've been in tech for 24 plus years,

but you know, all the stuff that you've
listed, I've only done in the last.

Like three-ish years since I've,
um, become a developer advocate.

And it, it feels like this awakening
of my career that I hadn't, you know,

I, I didn't even know tech could be
cool like that until I, I got into

Cloud Native and developer advocacy.

So yeah, I, I definitely credit, um, my
employer for, for giving, for allowing

me to build that stuff into, into my job
'cause it's otherwise, um, that would be a

yeah.

There's this kind of common misconception
about developer relations and

advocacy that, you know, we just go
around conferences on our private

jets and, you know, jump on a stage,
have a drink, and it's like so easy.

But it is a hard life trying
to be that present with so many

people in so many communities.

And, uh, it's, it's, I mean,
it's hard to say this right,

you know, but traveling is hard.

Full stop doing it with intent
to educate and inspire and

help people and be engaging.

It's, it's a tough gig.

So, yeah.

Um, I don't envy that anymore.

I've not been in DevRew for a while.

I so it, it was tough, especially,
I've got two young kids, so traveling

a lot was very hard as well.

Anyway.

Yeah.

Yeah.

Oh yeah.

With that,

so we're about a month now after CubeCon
in London and um, I'm just gonna bring

up a couple of the conversations that
I had there with people because I think

it leads us in nicely to what we want
to talk about, which is observability.

Now every year we go to Cubecon, they
do the keynotes, they ask the question

of how many people are brand new
to Kubernetes and, and Cloud Native

and it's an overwhelming
show of hands, year on year.

This year, no exception with
12 and a half thousand people.

People are new to microservices, Cloud
Native, Kubernetes containers, all of

this, and they get promised this, this
ideology that if we work in microservices,

their jobs and their lives will be easier
because they're slinging less code with

less responsibilities and all of this.

but there's the downside of microservice
and Cloud Native, which is you now have

12,000 applications to keep running.

Arbitrary number I'm throwing out there.

Um, and you need to know when that's
broken and I think the classic example

that I love to talk about with people,
it's like if you have a monolith, a

monolith, you stick a health endpoint on
it, and if it returns a 200, you can kind

of assume that the application is running.

Okay?

When you have.

Many microservices.

Understanding the health of
a whole application is almost

impossible to a certain degree.

Um, but then there's healthy for
what actions and what systems, what

personas, et cetera and this is where I
think people need to start diving into

observability and understanding how do
we bring a little bit of sanity to the

chaos that is a distributed system.

Um, and there's obviously observability.

There's this recent conversation
about observability 2.0.

There's open telemetry, there's
a lot of moving parts in this

space, and you are the expert.

So for the people that are listening
to this and they are on that journey

to Cloud Native, can you help them?

How do they bring some, uh, some sleep?

How do they get more sleep while shipping
microservice Cloud Native applications?

Sorry, that's a pure wide arch and
broads question, but you know, good

Yeah.

And I, I mean.

I think you, you hit it spot on, like, you
know, with, microservices architectures,

now you've got all these moving parts,
like you have so many moving parts

interacting with each other, and it's.

Sometimes unpredictable.

What might seem okay on the surface is
like some weird ask crap going on deep

down below that you might not be aware of.

And you know, observability is
the way to help surface that.

Right?

Um, and especially with
distributed traces, I know.

Um, you know, observability itself
has been a bit of a journey, right?

Because I think a lot of people,
they started with like traditional

monitoring, which is like very sort
of a reactive thing, and it's very

like metrics based and, um, and,
you know, logs based and all that.

And that's great.

And, you know, we, we
gotta start somewhere.

But then observability
kind of opened the doors.

And basically said, Hey, you know what,
um, let's look at everything in terms of,

you know, like, uh, distributed traces.

I see the distributed traces, the
star of observability because it gives

us that end to end view of what's
going on right from start to finish.

You know, I love using the example of
like, you, you're shopping for shoes.

You click on the button that says
Add to cart, and you know, you

have a distributed trace that.

That captures exactly what happens from
the minute you click on that add to cart

to when, um, the item is added to your
cart and the trace shows you exactly

what is going on in each step of the way.

Right, right down to those database calls.

And then you've got like
your supporting actors.

You've got your metrics and you've got
your logs, your logs telling us, Hey,

these are the things that happened along
the way, along that journey, right?

They're almost like little bookmarks point
in time, um, captures of what's going on.

And then, and then you've got
your metrics, which tell us.

Things like how long did it
take to get from this service

call to this service call?

How long did we wait, spend, uh, time
waiting on that database to return?

Um, you know.

We can even take it, uh, more
broadly and look at like, how many

shoes did we sell in the month
of November compared to December?

And is it because we saw some like
performance blips because maybe November

was busier than than December, or
December was busier than November.

So having that.

Overall, um, view of what is going on,
I think helps make life so much easier.

And what I will say with observability
is I think, uh, and I've, I've

written about this recently.

I think sometimes observability is treated
as like this sort of adjacent thing.

It's either like an adjacent.

To like SRE thing, or it's
like, oh, it's an SRE concern.

And it's like, yeah, it will definitely
help your SREs understand your system, but

like everyone has a part to play, right?

I mean, someone's instrumenting your code.

It's gotta be the developers.

You're not gonna ask
your SRE team to do it.

And, or, or like my.

My pet peeve favorite, the observability
team will instrument your code

and create dashboards for you.

It's like, bro, I don't know
what you're, what you need.

Like, I don't have that context.

So how can you ask an
observability team to do that?

Um, you know, developers need to
instrument their own code in instrumenting

their code and enables them to
troubleshoot their own code, first of all.

Um, so two for the price of one.

Secondly.

When you hand off that code for testing
your, your testers can go in and say, oh,

I found a bug because I have instrumented.

Because the developers
have instrumented the code.

They can go in and say, Hey.

I found a bug and I know where
it is or they can go, I found a

bug, I don't know where it is.

Developers, you need to go back and
instrument the code because I don't

have visibility into what's going
on and then by the time it gets to

the SREs, hopefully, you know, if
there's an incident or whatever,

they have the information required.

To troubleshoot.

It's not gonna be perfect.

I, you know, observability is an iterative
process, but everyone has a part to play.

Right.

Um, and I think that's the thing that
people need to really keep in mind.

Um, I also take issue with what so-called
observability teams, because oftentimes,

um, companies will split spin up
observability teams as like the catchall

of like, you will take care of all things
observability, including instrumenting

code and creating dashboards.

And I've, I, you know.

I was running an observability practices
team at two calls, and I had to push back

a lot because our team kept being asked
to instrument code and create dashboards.

I'm like, this is not what we do.

We, we are the experts on observability.

We'll tell you what practices you need
to follow to instrument your code.

Um, we can give you guidance.

We'll come up with the company-wide
standards because I think you need to have

some sort of standard oversight around it.

But like, this is not a,
we do your work for you.

You have to do your work for you.

Like a developer writes log statements.

So what, what's the difference?

Between like that and adding
like some traces to your code.

It's, it's just like it, you have to
wrap your, your mind around it, right?

Because everything new, like there is,
there is the tendency for resistance.

There's the, oh my God,
there's the learning curve.

Have to learn yet another thing.

It's overwhelming, but then I look
at it in like, if your, if your

house is on fire, are you going
to keep building the living room?

Yeah.

You would hope so.

I sure hope not.

Take the time to learn this stuff,

right.

Learn.

Learn the observability stuff
so you can save yourself.

Save your house, put up the fire.

Is funny that when you talk about
observability teams, and it takes

me back like, you know, I mean,
you've been in this industry,

like you said, for 20 odd years.

So when DevOps was gaining ground,
every company in Scotland was

hiring DevOps people and DevOps
teams, and you're like, hold on,

You've kinda missed the important message
of what you're supposed to be doing here.

You can't just stick it on a job
title and say, problem solved.

That's not what we're trying to do.

Um, yeah, so there's, like I
said, it's an iterative process.

So for with that in mind, and there's so
many avenues we could go down here, right?

But let's kind of start from the
idea that people are trying to

do the migration to Cloud Native.

I think it's very rare that
people have, they start with

a microservice architecture or
a Cloud Native architecture.

They have something that's old
and they want to modernize it.

Um.

What does it mean then
to instrument your code?

You know, when we talk about adding traces
and, and metrics and hopefully logging

already exists for these companies.

I hope so.

Hopefully it's centralized, you
know, this isn't 1994 anymore,

but, what is the starting point?

Um, and I know there's ways to do this
with auto instrumentation and manual

instrumentation, so maybe we can touch on
what is the, the approach today in 2025?

yeah, that's, that's a great point.

And as you pointed out, there's,
there are two ways using open

telemetry, which is, you know.

The, the CNCF standard, it, it's
the CNCF tool, um, which has become,

I would say the defacto standard
for instrumenting applications.

It has the backing of, of most of
the major observability vendors.

It's the, got the second highest numer of

Oh wow.

contributions behind Kubernetes.

So, um, and it's a, it, it it's massive.

It's, it's incredible and it's you
know, very sort of thoughtful, uh,

community where, you know, there
isn't this attitude of like one vendor

reigns supreme over all of them.

Um, it by design it's like lead is,
uh, vendor neutrality is the message.

I work with competitors all the time.

I don't see them as competitors,
they're friends, and we're all

working towards a common goal.

So with instrumenting, with Open
Telemetry, as you said, there's.

Two forms of instrumentation, right?

There's the, um, manual
instrumentation and then there's the

auto or zero code instrumentation.

And as you, as the name implies, zero
code instrumentation really means you are

not touching the code to instrument it.

There's usually like some sort
of a wrapper around your code.

Um, so like for Python and or Java,
which I've worked with, there is like a

Python or Java wrapper around your code.

which then will inject the auto
instrumentation, um, for, for

your code and especially like
if you're using libraries that

have been auto instrumented,
like python flask, for example.

you don't have to like go in and,
and instrument your requests because

that's already taken care of for you.

So you apply this wrapper and then
magic you, send your data to your open

telemetry collector, which is a vendor,
vendor neutral agent that is basically

like it, it's basically an ETL tool.

It ingests o hotel data transforms and
then spits it out to, to a somewhere,

which can be an observability vendor
and the magic it magically appears.

then you see your traces on there and zero
code instrumentation is a great starting

point, I think because, you know, it, it
is low effort, but you can get into some,

um, a little bit of trouble with zero code
instrumentation in the sense that like you

can end up with more data than you bargain
for more data than is relevant to you.

Unfortunately, with zero code
instrumentation, there is a way to like

turn off like for certain libraries
for example, Hey, I don't want.

To instrument everything in this library.

So you turn off the noise and you know and
then beyond that, I think you need to go

in and start manually instrumenting the
code because zero code instrumentation

makes the decision for you, right?

Like it deems like this is
important for instrumentation.

That is what's gonna happen.

Manual instrumentation requires you
to be a lot more thoughtful, right?

Um, because, I think it's gonna be
one of those things where it's kind

of like creating an SLO, right?

It's like you, you take a stab
and you keep refining it, right?

So, you're gonna start.

You know, adding traces in the
same way that you like, add a

log to your, to your code, right?

Um, there you select, a chunk of code
that your trace is going to apply to,

and you can add attributes and you can
add what's called SPAN events, which is

like a, a log embedded into your span.

and so that's how you go about
it and it really is a process.

I, I would say as the developers going
along, um, they can, use the process

of debugging their own code to sort of
understand what are the pieces that would

be important for, for instrumenting.

And I would say the
same thing for metrics.

I think for metrics, a lot of the time
things that are really important to us

are things like, how long is the span?

And the span basically represent
an operation or a process.

Um, so you wanna capture that kind
of information, which can be derived

from, a span, by the way and then
you can create metrics for, for other

things that you may deem important.

Um, you know, you've got your typical
ones like your CPU and your RAM

usage and then other things, like I
mentioned like in our shoe example,

like how many, how many shoes are
we, are we processing, per month?

And so it's gonna be, you know,
it's gonna be that kind of process.

You don't have to boil the ocean.

Um, I think when instrumenting
code, especially.

When you're instrumenting existing code,
um, by the very act of instrumenting

existing code, you're basically
introducing technical debt into it.

It's just the way it is, right?

You're, you're adding new code.

Um, you are, it is code.

So you're probably opening yourself
up to bugs, um, as a result, right?

Because you can, you can goof
up when you're instrumenting.

Um, and so you just have to kind of
be mentally prepared for like, what

to expect when you're instrumenting
so that you're not like, oh my God, I

thought this was gonna be so much easier.

And then all of a sudden, like, it's
10 times harder than you expected.

Like, manage your expectations.

I think that's, that's super important.

Um, and then, so there's that
like manage those expectations.

But then also the other thing is
like one sort of like quick win is

instrument your homegrown libraries and
frameworks, because chances are your code.

It touches a lot of that.

So now you're getting kind of like
you're instrumenting a bunch of stuff,

um, automagically just by doing that.

Um, and then the other one is
any new code that you're writing,

instrument that as you go along in
the same way that you would add logs

statements as you're, as you're going
along, just get into that habit.

Like, you know, we, we've gotten,
hopefully many have gotten in the,

the habit of, test driven development.

Think of it.

As the same way it's called
observability driven development

you're instrumenting as you go along.

So if you forge that habit right
away, um, then at least like the

new code will be taken care of.

I'm not gonna lie, it's,
it's not an easy process.

It's not like a waving a magic wand.

I guess auto instrumentation is more like
that, waving a magic wand, but there it'll

only take you so far and that's why you
do eventually have to go into that manual

Thank you so much.

So like, first of all, let's
clarify a couple of things.

So OpenTelemetry as you said, is now
almost ubiquitous within this space.

I think this is the standard that
everyone should be building on.

Um, it supports Go,
TypeScript, Ross Python.

Like I think any language that people are
listening to this and thinking I write

in this is probably covered and I would
assume at least pretty good support.

Um, now there's the manual
instrumentation, let's focus on that.

The also instrumentation is there for
people that want to experiment, but

you know, as you said, the value comes
from you deciding where to create

spans and events and all of this stuff.

And we can get into the cost of
observability as we kinda get into

this conversation, but the challenge is

we're building distributed systems.

We've got network traffic,
we've got other services.

These things we have to correlate as a
request comes in through the front door

and all the way through the system.

And this is where OpenTelemetry and
Tracy's are so important, but we're seeing

now that people can use this information
to do, um, spans within a single service.

So this is function calls and, um,
probably just function calls, I guess but

even in, in business defense, you talked
about shoes and I don't know how, if

that's a real example that you've seen in,
in the world before, but yeah, why not?

If you have an e-commerce store,
em met, um, event information

on business domains, right?

Because there's analytics within
all this too that could be

propagated up to some other system.

So.

Uh, this is a, a challenge, and I
don't want to ask like an impossible

question, but when is too much, right?

Like, how do people work out what is
important to instrument and what isn't?

Is there a heuristic that you would
say helps people be successful

when saying, you know, taking a new
function and adding a span to it?

Is that always the right approach?

How do I turn that off?

And what is the cost of infinitely
adding these to every single

function within my application across
hundreds of different services?

And again, sorry, that's a really
tough question as well, but.

Yeah, I mean that's a lot to
unpack, but I really good points.

Um, I think, you know.

I think the best way to answer the, the
question it's always, it depends, right?

And I think the best way to answer the
question though is to look back at the

definition of observability and I've
been quoting this definition a lot.

It's from Hazel Weekly, and I
think it's a great definition

because it, it encompasses so much.

So observability allows us to ask
meaningful questions, get useful answers,

and act effectively on that information.

If you're finding yourself at a point
where you can't ask the questions, you.

Can't get the answers from that,
or you can't act effectively on

the information like any of these

any combination of these then you
need to go back to the drawing board

and revisit your instrumentation.

Um, now then you
mentioned the cost aspect.

And that's tricky, right?

Because the temptation is to like
add distributed tracing to everything

spans to everything and that can get
very costly, not just from a dollar

science standpoint, but from an
environmental standpoint as well, right?

Because.

Any of these things uses up energy
and, you know, data centers,

um, suck up a lot of energy.

Um, they, they add a fair bit
to the, the carbon footprint and

it's only going up, um, especially
when we consider things like AI.

Um, you know, it's a
double-edged sword, right?

Lots of compute.

So, and it's not just like you,
your application generating,

um, the telemetry, it's also the
ingest of the telemetry, right?

So whether or not you're sending it
to, um, uh, you know, a homegrown

observability tool that you, you know,
like running, uh, like a self-hosted,

uh, observability, um, set up on your,
in your own data centers or whether it's

a SaaS tool, um, and even like as, as
a little sidebar, like I, I did a talk

at, Cubecon on like, you know, examining
like, can we, can we tune the hotel

collector so that it consumes less energy?

And as part of the talk, um, one of
the things that my, my talk partner

Nancy Johan and I research was, we
looked at like, you know, depending

on where your data center is hosted.

It can use up less energy.

Um, and fun fact, Sweden, uh, the
data centers there are like the

greenest data centers in the world.

So, so yeah, like we've got, you
know, we've got our dollar sign costs,

we've got our environmental costs for
the dollar sign costs I think it's

a matter of being very mindful and
effective with your observability data.

Like what are you sending over?

And I think one of the ways,
um, to do that is to limit

what you send through sampling.

Um, another interesting, thing that I've
seen, suggested is to use feature flags.

So for example, if, maybe you wanna
instrument all the things, but maybe

you have feature flags that, that turn
off like instrumentation until, like,

things start going kaka, and maybe you
wanna switch on those feature flags

so you have that like extra bit of
visibility so you can see like all

of the things that are going wrong.

And then once you've like figured
out what the problem is, then you

can shut off those feature fat flags.

And then, limit what's, what's being,
what's being emitted to, to your

observability backend for analysis.

So those two things can,
can definitely help.

I guess it's easy for us to over
miss the environmental impacts.

I'm glad you brought that up and
mentioned that like it is really

important and, you know, you said
AI, but yeah, that is literally

consuming the environment right now.

So we need to be a bit more
careful with these decisions.

There's also the, the dollar cost
you talked about as well, and

the one you mentioned earlier
is the technical debt as well.

Like, you know, if people are,
companies and teams are building

libraries to do auto, you know,
manual, but auto instrumentation for

the downstream consumers, they're
automatically getting all of these

spans and events that they don't really
know exist unless they dig into it.

So like, yeah, there's
so many considerations.

It's really.

It's a hard problem for people to get.

Right, and that the thing I liked there,
you mentioned the quote from Hazel is

that, you know, if you're not asking
these questions yet, you probably

shouldn't be instrumenting it, although
then you're into a path of resiliency.

Engineering, I dunno.

When things are broken and you ask a
question, you work backwards and you

instrument it and you move forward,
like, I don't know, there must

be some sort of golden path here.

But, um, that's not something we're
gonna solve on, on, on a a 40 minute

podcast, so we'll move on now.

Uh, this is great.

I think, you know, people have
got a good idea of everything

they're doing now, right?

Tracy's spans events, open
telemetry, they're happy.

They're like, yes, I'm gonna do this.

This is perfect.

It supports my language.

I've got a collector and, uh, I'm,
I'm laughing all the way to sleep now.

But unfortunately, um, there's still
a lot of things to consider now.

There's so many databases where you
can put this telemetry data, right?

I think there's the Tempo
project from Grafana.

There's all the SAS companies like
Dynatrace and uh, Datadog and New Relic.

There's, uh.

Uh, the Yeager, I don't even know
what database they use or they have

their own database, but there's
Yeager for the tracing stuff as well.

When it comes to the technology stack,
beyond the specification of open

telemetry, how do people make the
decision of where do I put this data?

How do I visualize it?

And, you know, is there a right
or wrong answer to any of that?

That's a great question and I think it
boils down to a personal choice, right?

Because I think because OpenTelemetry
is a standard it means that all of

these vendors that support OpenTelemetry
are ingesting the same data.

So now what differentiates one vendor
from another is what do they do with

your data that makes it useful to you?

That allows you to ask those
questions, get those answers, and

act effectively on the information.

And so it becomes a matter of
personal choice at that point.

Right, because is it, there might
be a feature from a particular

vendor where you're like.

Oh my God, this thing is blowing my
mind and I cannot live without it.

In which case, you know, it's kind of a
no-brainer but you also, I guess, have

to balance that with, with cost, right?

Um, because some vendors are
more expensive than others.

Some vendors can be more
expensive than others.

if you don't do your sampling properly
and, and just like instrument all the

things, um, that can add up a fair bit.

So these are, these are the
types of, of things to, uh, to

consider when going with a vendor.

The other thing that you mentioned that
I've seen even in, um, you know, just, uh,

use cases that I've read or, or like, you
know, I review, um, CFPs for, for CubeCon.

Um, I've done it a number of times
and, um, it's interesting to see,

um, the number of proposals that come
in that talk about, and we use this

tool for logs and this tool for, for
metrics and this tool for traces.

And my thought is, you know.

Um, I feel like you're not
getting the most out of your

observability story here in doing so,
um, because you don't have like a single

data store and you don't have not only
a single data store, but a single place

where you can correlate all the data.

In one place.

So, um, I think a lot of organizations
end up losing out as a result, either

because they're using a tool that
doesn't, um, that doesn't support

all three main telemetry signals.

i.e. the traces, logs, metrics,
or they've decided, you know,

because some legacy, whatever.

Oh, you know, like we send our,
our logs to Elastic and our metrics

go to Prometheus, and then our
traces go to Jaeger and it's like.

Okay.

Um, and so now you know, I, and I'm
gonna borrow a turn from, from my

husband, which describes it perfectly.

You're doing swivel chair observability
where you're swinging from one system

to another back and forth trying to
correlate this stuff where you should

be doing the put up, put your feet up on
your desk observability where you should

be able to see everything in the same
interface, everything nicely correlated.

Um, you know, our single
pane of glass observability.

And that's, I, I think what
we should be aspiring to.

And I think the vendors that provide that
single pane of glass observability and the

organizations that embrace that, because
some organizations are using vendors

that support all three, uh, all three
signals provide the single pane of glass

that provide the single pan pane of glass
observability, but are still, you know.

Um, forking different signals to
different, to different backends.

So I think those are the ones
that are going to get the most

out of their observability.

I am going to ask this
question in two ways.

Let's start with the first one.

Um, I. We understand.

I think it is, it's
very obvious now, right?

There is a cost.

You can't just rate every single
span and event to a SaaS provider

because the cost will be, I mean,
it can be pretty, pretty large.

Have you seen in practice people taking
a tiered approach to this where say

maybe they have like a, a Grafana,
what do they call it these days?

Loki, Grafana, LG, TP I don't know Primi.

M me, Memer.

Yeah, memer.

I think it's

LGTM

So, uh, do people do that for like
high frequency, high fidelity data

that lives for an hour where they can
sample it based on success rates and

then push the outliers are anomalies to
SaaS for them to do their magic, right?

Like nobody's ever seen Dynatrace and
Datadog and New Relic don't offer.

Exceptional product is
just very expensive.

Like is there a way where I can, where I
can say, okay, I'm gonna write everything

here locally on a big, chunky, bare
metal machine, and I'm gonna condense

that down and send everything else,
and then take advantage of Dynatrace

to understand what went wrong with
the things that are really important.

Like, is that something
you've seen in practice?

Does it work?

And should people aspire
to that in some capacity?

I have seen, uh, so in my last job
before I got into, um, Devrel, that's

something that one of my teams, was
trying to do when I was managing the

observability practices team at two cals.

Um, I. We were looking at, um,
uh, basically long-term storage of

logs, for example and not doing it
through the SaaS vendor that we were,

that, that we were, were paying.

Um, and we were looking at, at like self
storage of logs using like an opensource

tool to enable that for two reasons.

A, um, it gets really expensive and B
we wanted to make sure that we had that

storage for, for compliance reasons.

Right?

and that's, that's another thing
that, um, organizations need to

consider because I, I think that's,
that's why a lot of the times they'll

also want that long term storage.

'cause I think.

Some observability companies initially
were like, oh, we'll, we'll only store

your, your data for, for a few days
because, you know, observability is in

the, in the here and now we wanna kind
of, if the problem's happening now, that's

what we wanna be able to troubleshoot
but, um, what if you wanna go back and

look at history, um, or as I said, like
for compliance purposes, you need to

retain the data for whatever reason.

Um, that's where it gets very expensive.

And then that's where, you
know, because open telemetry.

Gives you that flexibility via the hotel
collector where you can basically, um,

send your data to multiple destinations.

Then you can send, your stuff to X SaaS
vendor, for your here and now analysis,

and then your long term storage of like
say logs to whatever, like open source

tool and maybe, something that compresses
your data so that you're, you're, because

you're still paying for storage, it's just
internally in your own, in your own stack.

Um, that, that's definitely,
that's definitely an option.

Yeah, I think that's a good pattern.

I mean, again, I'll, I'll lead
on my own experience here and

my open telemetry observability
knowledge is, is very pure, right?

So you'll need to fill in some gaps here.

But, um, six years ago when I
worked at Influx, it was just,

uh, working with time series data.

And one of the really cool things
that we could do there is I

could store samples, um, every 10
seconds and keep that for 24 hours.

Then we, we'd enter a down
sampling loop, which is okay.

That data was useful for 24 hours nd if
after that up to three months, we only

want the resolution to be maybe every
five minutes because that average over

that time is, is valuable enough to us.

It's just something that's even possible
with Tracy's I, obviously metrics

that can be done 'cause it's just,
you know, vanilla time series data.

But Tracy's are very different
and I've heard of a term that

I don't fully understand, so
maybe you can fill me in as well.

But people talk about exemplars
and maybe that ties into this

conversation, I'm not really sure,
but what's your thoughts on that?

Oh yeah.

So exemplars, um, at least in the
context of OpenTelemetry is, is about

correlating like your metric tier trace.

Um, and that one's kind of an
interesting one because it's only been

fleshed out for, I think it's been
a while since I've checked it out.

But for Java and experimentally
for .net and nowhere else.

Wow.

Okay.

Maybe a bit too soon then to
be kicking the tires on that.

Yeah.

Yeah.

That's still, that's
still a work in progress.

That's still a work in progress.

But yeah, that's kind
of an interesting one.

But yeah, that, that's my, that's
my understanding of, of exemplars

from what I've seen in NoTell but
yeah, I mean, in terms of, of like,

tweaking the granularity of traces,
I think it's, it's really, I guess.

A matter of like, do you store all
the traces or some of the traces?

Um, so then I think it becomes kind

What about then with the dimensionality
or cardinality of these traces?

Like, you know, I think Charity
Majors has been vocal about how we

should have lots of, um, different
properties with done a trace revealing

as much information as we can.

But that feels to me something that
gets very costly very quickly as well.

And the value of that data three months
from now might not be as high as it was

three seconds after the actual trace.

Like, um, yeah, uh, I'm
assuming that I. Yeah.

Yeah, that one's an interesting one
because I like, I, I think she's right

in the sense that it's, it's good to have
as much information about our traces as

possible, but then again, you run into
the, it's a. It's a data storage problem

and, and I think like different, also
different observability vendors will

charge you on the data differently.

Um, so maybe like having high
dimensionality might be cost effective

if you're working with one vendor
and then you switch over to another

vendor and all of a sudden it's
like, boop, your costs have blown up.

So that's another thing, um,
to consider as well, like it's.

You know, you're, you're still
capturing the same data, but

do you need to tweak your data?

Because now it's, it's not
so pocketbook friendly.

I'm sure that people listen to, I don't
know if we're, we're helping people.

They're scaring the
absolute crap out of them.

I'm just confusing them.

It's like, it depends.

I'm sorry.

All right.

Well, let's chit chat a little bit.

I swear observability is worth it.

let's, let's change tack
a little bit, right?

I mean, I think you've done a great
job of explaining observability,

OpenTelemetry and you know, all the
bets and the nuts and bolts, the

lexicon that people need to understand.

Um, but you have been under
developer advocacy now for a

while, and I'm curious, you know.

What are some of the lessons
learned that you've seen as you've

personally adopted Open inflammatory?

You see or spoken to other
people doing it right?

Like how can people be successful?

Do you have any tips like, or even
just general advice of how to get

started beyond MPM install or Go mod
get, or whatever it is they're using?

I think first of all, for
learning OpenTelemetry.

the best way to learn is based
on how you learn best, right?

Um, we actually have a, a series
on, uh, as part of the Otel end user

sig, called Otel Me and we have, um,
different Otel practitioners come

on and talk about their experiences
with, with using OpenTelemetry.

And we had a, a guest, um, last week, uh,

and he got into OpenTelemetry through
Outreachy and he was talking about,

um, his journey into OpenTelemetry and
he said for him, like he's a visual

learner, so he really craved having,
like those videos explaining how things

worked for him and then the videos kind
of, uh, gave him enough of an overview

where he's like, okay, these are the
topics that I wanna dig deeper into.

And then he dove into the docs, into the
hotel docs, um, for more information.

Obviously.

I think in, in an ideal world, I would
love it for everyone to go to the

hotel docks, um, as their, you know,
one stop shop for all things Otel.

but, you know, docs aren't perfect.

I think the folks running the Otel
docs team are fantastic and are

doing a really great job in, in, you
know, constantly improving the docs.

but I think in, in some cases, some
people, don't find the docs useful

enough as a, as a starting point.

So then they'll, you know, I, I.
Honestly, Google is your best friend.

And, and I will, you know, there's
like so many people writing about

OpenTelemetry, um, from various walks of
life, whether it's, you know, blogs from

observability vendors or personal blogs.

Um, like I myself documented
my own OpenTelemetry journey.

I was like learning in public as
I was, you know, I was managing

this observability practices team
and I'm like, oh, damn it, I don't

know anything about observability.

I'm gonna learn and.

I'm gonna blog about it as I go.

Um, so I guess Dere was like
perfect for me in that sense.

Um, I, I think, like I said,
Google is your, is your best friend

'cause there's like a wealth of
resources from videos to blog posts.

I think having these good overview,
either videos or blog posts to sort

of give you an idea of what those base
concepts are, um, for open telemetry.

Are great because then you can use that to
like, as, as it almost did in his journey,

dig deep, um, you know, go into the docs
and dig deep on a, on a particular topic.

So it sort of helps to
direct your learning journey.

I'm also gonna do a shameless plug here.

I do have an O'Reilly video course
on observability with OpenTelemetry.

So if you have a, an
O'reilly subscription.

You can check it out if you're,
if you're a visual learner.

So, um, yeah, I mean, tons of resources.

There's, you know, also I think
my, my former colleagues, Ted Young

and, and, um, Austin Parker have a
great book on OpenTelemetry as well.

Um, there, I mean, the sky's the limit.

It's, it's a matter of like, what, what
are I, I think we just, I think having

a good getting started resource of like.

These are the main concepts
and then these are the things

that I need to dive into is, is

Awesome.

And it's not a shameless plug if
it's super valuable to people.

So I'll make sure all these links
are in the description for people

to click on and make it easier.

Um, lastly, obviously you
are a CNCF ambassador.

You are an hotel contributor,
you're in this space.

What if someone is listening to
this and going, you know what?

I want to help.

I wanna join this, this mission to
make OpenTelemetry easier for people.

How can they get involved and how
can they contribute to the project?

Ooh, amazing.

Yes, great question.

Um, so I'll, I'll send you afterwards.

I have a blog post that I wrote
on how to contribute to open

telemetry from my viewpoint as,
as someone who was in those shoes.

But, um, in a nutshell, basically, you
know, first of all, join CNCF Slack.

Um, in CNCF Slack, there are
gajillions of o hotel channels.

Like they all start with Otel Dash.

Um, pick an area of open
telemetry that interests you.

So, for example, you wanna learn more
about the Tel SDK or maybe like you

are a Pythonista and you would love to
contribute more to open telemetry Python.

Join those channels that interest you and
just, you know, monitor the conversations.

Just, just be an observer.

Fly on the wall.

Join, join the SIG meetings.

You don't have to be actively involved.

You can just sit and observe, um,
if you're looking for more active

involvement, but are afraid of, you know.

touching code at this point.

Not, not because you can't code,
but because contributing to an open

source project can be overwhelming.

A great place to get started
is always in the docs.

Um, because as I've said, you know,
I love, I would love for the Otel

docs to be like, you know, the.

The book of record for all things Otel,
and the only way for that to happen

is to have people, um, who have used
OpenTelemetry and have found a gap in the

docs and, and have made and, and make, uh,
an effort to contribute back to the docs.

Like for example, I was doing, some
research for, um, for my CubeCon talk

on, on like the Otel, the greener hotel
collector and, and I was using this

tool called the Open Telemetry Collector
Builder and I went to the Otel docs for

some guidance on it, and I got stuck.

And again, Google was my friend.

I reached, I phoned a friend, got
some help, and then I'm like, so, and

then I wrote a blog post about like,
what I did, but then I'm like, well.

I wanna be a good open source citizen.

So I actually made a point of contributing
back to the docs with the stuff that

I learned so that people wouldn't
be stuck like I was as well when

they, when they go back to the docs.

And I'm happy to say the PR was merged
last week, so that always makes me happy.

So it's such a great way, but
you know, bottom line, it's.

Such a great way to contribute
to open telemetry in that way.

Um, joining the hotel end user sig
is also a great way, even if you're

not necessarily an end user per se.

Um, it's a great way to contribute
because we have tons of things

that we run as part of the sig.

Like I mentioned, we have the this OTel
me series, so we're always looking for

people to, uh, to contribute to interview.

We have OTel in practice
where, um, you know.

If you have an interesting ol
topic that you wanna talk about,

um, and wanna present on you,
maybe wanna test out a talk.

You know, you, you wrote a talk, you wanna
flesh it out, use this as a guinea pig.

Um, I. So you can, you can join that.

Um, we're, we work with the SIGs to run
surveys, so we liaise with the SIGs.

Um, and, and so we've got a couple of
people who joined recently who have

taken on that mantle of like really,
um, streamlining our survey process.

There's always stuff, to be done.

And so, um, if you're looking
for a way to contribute, that's.

That's a great way to get started.

get

Fantastic advice.

Alright, I think that is now us at time.

Um, do you have any last words for
the audience before we say goodbye?

I would say, you know, don't be shy
about contributing to open source,

especially OpenTelemetry when
you're submitting your first PR.

Don't be.

Don't, don't feel overwhelmed because,
everyone has been nothing but nice.

Um, for it in.

Uh, to me and, and to others that
I've talked to since, um, since

contributing to OpenTelemetry.

The, the comments are always thoughtful.

No one is ever aggressive or rude,
so it makes me wanna contribute more.

So don't be afraid because this is
honestly like a wonderful community.

And if there's, there's one CNCF
community that I recommend that you join.

And of course I am very biased.

I, I would say definitely join OTel.

We are friendly much

Well, it's been an absolute pleasure.

Thank you so much for your time.

Thanks for having me.

Thanks for joining us.

If you want to keep up with us,
consider us subscribing to the podcast

on your favorite podcasting app,
or even go to cloud native compass.

Fm.

And if you want us to talk with someone
specific or cover a specific topic, reach

out to us on any social media platform

and tell next time when exploring
the cloud native landscape on three

on three.

1, 2, 3. Don't forget your compass.

Don't forget

your compass.

Episode Video

Creators and Guests

David Flanagan
Host
David Flanagan
I teach people advanced Kubernetes & Cloud Native patterns and practices. I am the founder of the Rawkode Academy and KubeHuddle, and I co-organise Kubernetes London.
Laura Santamaria
Host
Laura Santamaria
🌻💙💛Developer Advocate 🥑 I ❤️ DevOps. Recovering Earth/Atmo Sci educator, cloud aficionado. Curator #AMinuteOnTheMic; cohost The Hallway Track, ex-PulumiTV.
Adriana Villela
Guest
Adriana Villela
Observability, Platform Eng | Principal DevRel & Sr. Tech Leader | CNCF Ambassador | International Keynote Speaker