November 8, 2023 56:09 E11

InfluxDB 3 & Rust

Welcome to Cloud Native Campus,
a podcast to help you navigate

the vast landscape of the cloud
native ecosystem. We're your hosts.

I'm David Flanagan,
a technology magpie that can't stop

playing with new shiny things.
I'm Laura Santamaria,

a forever learner who is
constantly breaking production.

Do you want a single database to
store high precision,

multidimensional time series data
that supports infinite cardinality?

Well, we're not there yet,
but Paul Dix does share his vision

and roadmap for InfluxDB three.
Do you want to hear, David,

not have to talk about Russ to get a
guest to actually talk about rust.

Now's your chance.
In all seriousness, we not only get

to talk about a move from Go to rust,
but also about observability and

how it's changed over time,
as well as a little bit about

open source licensing changes.
So less rust mean go.

No, but really, let's all rust.
All right.

Thank you for joining us, Paul.
And for anyone who's not aware,

can you please tell us a little bit
more about you and what you're up to?

Yeah, so I'm Paul Dix.
I'm the co-founder and current

CTO of InfluxData.
We are the company that make

InfluxDB, which is an open
source time series database.

It's useful in use cases like
tracking system metrics and

application performance metrics.
Sensor data, that kind of stuff.

My background is as a programmer.
I started the company in 2012 and

we went through Y Combinator,
actually under a different idea.

And then we pivoted to this idea in
the fall of 2013. Awesome. Thank you.

There's more I can go on like at
length, but I'm trying to trying

to be concise. Yeah. All right.
Well, let's talk about your work

on InfluxDB. So. Over the last.
I'm not sure how many years you start

to work on a rewrite of InfluxDB,
moving towards using rust on a

project called IOCs,
which I believe this year you have

now pushed out into production as
part of your InfluxDB Cloud offering.

So I really wanted to kind of drill
into that and just understand more

about one why a rewrite change in
programing language re-architecting

the system using Apache Arrow,
a huge task for a company to take on

and I'd love to know more. Yeah.
So I'll just start with the obvious,

which is a rewrite is basically like
the worst possible thing you can do.

Like unless you're just like a
sadist or a glutton for pain.

Um, you should avoid a rewrite at all
costs. Okay, that's out of the way.

Um, so InfluxDB, the original
version is written in go,

and the core architecture of the
database is built around what I would

call like, a clever hack. Right.
So it's essentially it's kind of

like two databases in one thing.
One is basically a time series

database that organizes data on
disk as individual time series,

which are value timestamp pairs
in time ascending order. Right.

So as all the data comes in, it tries
to organize that data on disk in

that format, which is basically
like a very like indexed format.

The other piece of this is an
inverted index that maps metadata

to individual time series.
So the metadata is like a

measurement name,
a tag key value pair, a field name.

So just like, you know,
usually people are familiar with

inverted indexes from document
search where you have a document,

you have an inverted index of the
words that appear in the document.

And then when you search, you find
the words and you find the documents

that have those words, right.
In this case we say, oh tag key value

pair like host equals server A or
region equals Us-west and then you

find the time series that match
that piece of metadata. Right.

So what that means is as you're
ingesting all the data, there's a lot

of indexing work that happens. Right?
So it chews through like a lot

of CPUs and stuff like that.
And it reorganizes the data.

Now when a query comes in, if the
query is for one individual series,

generally those queries are very,
very fast, right?

This is a system that's optimized
for that kind of query workload and

for that kind of like needle in a
haystack query specifically on,

you know,
the the individual series and the

time range that you're looking for.
But again, like the problem becomes

as your index, the set of metadata
that you're tracking grows,

commonly referred to as like the
cardinality problem. Right.

The number of unique tag values that
occur, the actual just the number

of individual series that occur.
You spend more and more time

indexing the data,
and it just becomes more and more

expensive to ingest it. Right?
You spend more CPUs, you spend

more disk space storing the index.
And like all this other stuff,

it becomes really painful,
particularly as you try to add

like more and more granularity
and visibility, like more

precision into the measurements
that you're taking. Right?

Like, again,
more kind of metadata that you're

capturing around those things.
And then the other side of this

is when you want to do a query.
If you have a query that's going

to touch, you know,
tens of thousands, hundreds of

thousands or even millions of
individual time series, right?

You want to do an aggregation across
all these time series in a region or

whatever, and compute something.
Those queries become

prohibitively expensive.
And oftentimes, like,

you can't even do them because the
engine will just fall over, right?

You just the way to do it and to
map it onto that index structure

is just way, way too expensive.
So basically that's the version

of the database written in go.
Version one, version two,

same storage engine we created
like we wrote a storage engine

from scratch that kind of has this
architecture and over, you know,

we created the initial version
of that storage engine in, oh,

I lost my screen for a second.
Hopefully we're there.

We're we created the initial
version of that storage engine in

prototyped it in the fall of 2015,
and then we had the first release

in early 2016 that had that,
and we've iterated on it and

added to it over time.
And essentially what we found over

the next whatever number of years,
four years, is that people wanted

to have higher cardinality data.
They wanted to feed in data where

they actually didn't have to
worry about the cardinality or

the ephemerality of the values
that they were feeding in,

and they wanted to do more
analytical queries across it. Right.

And all of this stuff was built
around our query language in.

SQL, which is a query language
that looks kind of like SQL.

So it's kind of like it's kind of
familiar and friendly, but it's

for people who really know SQL,
it can be frustrating in unique

ways because it doesn't actually
work exactly like SQL,

but for some things it's like
it's super easy to create like a

time series query or whatever.
So again,

like there are multiple problems
we were trying to solve for how do

we solve this cardinality problem?
How do we give people a query

engine that can be useful for
analytical style queries on

larger chunks of data?
And then this other piece, which

was we needed to figure out how to
store a massive amount of data at a

much more cost effective way. Right?
InfluxDB one and two have just a

base level of assumption that
you have a, you know,

a locally attached SSD or, you know,
whatever, like an EBS volume,

high performance network volume
with provisioned IOPs and whatever,

and all of your data is stored on
that. And it's super expensive.

And for a lot of our use cases,
right, people could have a

year's worth of data, but 99% of
their queries hit basically the

the trailing like few hours or
few days worth of data. Right.

But they they want this stuff
available and accessible,

but they don't need it available,
accessible, you know,

at the same response times.
And they definitely don't need it,

like stored on expensive NVMe
drives or whatever. Right?

So we needed to figure out a way
to decouple the compute of

ingestion and query from the
actual storage of the data. Right.

And obviously, like all this stuff
is like building up over the

course of like, you know, 2017,
2018 is when I'm noticing this 2019,

it's just becoming more apparent.
And the thing is, also during

this time, there are interesting
things happening out there in

the infrastructure world, right?
The rise of Kubernetes,

for example, wasn't there when
when we first created InfluxDB.

So this idea of like containerized
applications and like this ephemeral

application stack or ephemeral
compute stack that you layer on.

And then the other thing, the rise
of object storage as basically

like a common storage layer.
Those things happened over the

course of that decade of,
you know, the tens, the 20 tens.

And I think one company that
really took advantage of that,

at least the the rise of like,
object storage and decoupled compute

from storage, was snowflake, right?
They're the first company that I

think that really commercialized
this idea of.

We can create this big data system
that stores data super cheaply

and just layers on compute on
demand to execute queries on it.

Now, snowflake is obviously
designed for completely different

use case than InfluxDB, right?
Snowflake is a data warehouse.

Data at scale.
Whatever InfluxDB is about

operational data and real time,
right?

You need to be able to query it
within milliseconds of it,

writing it into the database,
and you need those queries to

return generally subsecond,
so that you can build monitoring

learning systems on it.
You can build real time dashboarding.

So in 2019.
I actually like in the fall of 2018

is when I started picking up rust and
I thought, like again, like first

commit on InfluxDB was in 2013.
But the basis technology that we

built for it was actually we had done
and started in the fall of 2012,

and rust was not in a place
where I would use it.

Then I used go because I thought
we will be able to move faster.

Creating the database.
We use go as the language.

But in 2018 I started picking up
rust and thinking, okay,

this is actually interesting.
And then in the fall of 2019 is when

the async await stuff landed in rust.
And when that landed, I thought,

okay, this is probably going to
be like a a serious language for

building server side software
where you have to handle network

requests and like all this other
kind of stuff. Right.

That was I think for me that was
like the final like piece that I

was looking for.
That rust was had actually arrived

at a point where you where you
could use it to build a complicated

piece of server software,
and you wouldn't have to build

everything yourself. Right? There.
Certainly successful projects in

rust that started before that,
right? Like Linkerd did that.

But that point I was like, okay,
that's interesting.

So coming into the beginning of 2020,
which is when I kicked off this

project, you know, I just said,
okay, we need a different

database architecture.
This combination of this inverted

index plus this time series storage
and the way the entire database

engine works is not going to work
for how for for for what we want

to build, for the requirements
we want to meet. Right.

Like we're using like memory
mapped files, which is like again,

you're not going to get that in
object storage.

And it's not great to use in
containers and like all this

other stuff.
So I was like, okay, if we're going

to re-architect the entire database,
that's basically a rewrite of the

database. We could do it in go.
And, you know, there there's a bunch

of stuff that exists in go that we
would reuse that obviously wouldn't

be rewritten the the language parser
and like all this other stuff,

but basically like I was looking
at the project and I thought,

this is a rewrite.
Like if we if we actually try to

do make these big changes.
And again, in like late 2019,

early 2020, one of the other
things I noticed out there in

the world was Apache Arrow.
Like I'd known about the project

for a little while.
Apache arrow is like an in-memory

columnar format specification.
Um, and I was looking at that and

I was looking at Apache Parquet,
which is a file format for this

kind of structured analytical data.
And I thought, well,

there's I think there's really
something interesting here.

So again, like I wrote like some blog
post in like early 2020 where I said,

like, I thought that the different
pieces of the Apache Arrow project

would become like a way for
companies that are building data

warehousing systems and big data
systems and streaming data systems,

basically like all these like
analytical systems that are working

on observational data of any kind,
right?

Whether it's server monitoring
data or sensor data or whatever,

those standards would become a way
for people to collaborate and build.

You know, common infrastructure
but also proprietary solutions.

And those will be like the touch
points in terms of like how you

exchange data between these systems.
So that was kind of the thesis

in early 2020.
Rust as a language is going to

be better because one like the
multithreading support in rust I

think is just the way it handles
it is way, way better, right?

Because it's kind of enforced by
the compiler.

Um, we wanted strict control over how
memory is managed and that kind of

stuff, which obviously like basically
we didn't want a garbage collected

language performance was going to be
a critical thing, like go super fast.

Don't get me wrong,
I love it as a language.

It's way easier to learn and
work with than rust.

I think my personal opinion,
but I just thought for this kind

of software, for a database system
that has to perform at scale

with high performance like rust,
just seems like the logical choice.

And if it wasn't rust,
it would be like C plus plus.

But I think in this day and age it
would be it's just a better choice

to use rust. Yeah. And then yeah.
And then over the course of the

next three and a half years,
like initially we it was basically

like me and one other guy within
influx for a couple of months.

And then we hired somebody else,
Andrew, who's still with us.

And basically the three of us kind of
treated it as a research project

almost for the first like six months.
It wasn't like I wasn't within

the company.
There's no way I was going to

get like people to buy off on.
Oh, Paul wants to rewrite the

database. Yeah.
Let's put a bunch of effort into

that.
Um, no, it was basically like,

I'm going to do this as a
research project because I think

it's interesting and I'm going
to see where it leads.

And then by November of 2020,
uh, we like we'd had enough of

the pieces figured out, right?
We were going to use Apache Arrow.

We're going to use parquet as
the persistence format.

Object storage is where all the
data is going to live.

Uh, Apache Arrow flight was going
to be the RPC mechanism that has

since evolved into flight SQL,
which is a new standard they

have for essentially making
doing RPC and SQL queries in

these kind of data systems.
And a project called Apache Data

Fusion.
Right now it's a sub project of

Apache Arrow.
It's data fusion is a SQL parser,

planner, um, optimizer and
execution engine written in rust.

And we're like, at that point in
the summer of 2020 when we decided

to build around these things,
you know, data fusion wasn't even

close to as far along as it is now.
And we knew, um, we knew that we'd be

like investing significant effort.
Like we actually looked at using

other engines as the core of the
database.

We looked at some C plus plus engines
just to think like, you know,

we didn't want to write our own,
but we needed something that was

like optimized for our use case
for time series.

And we saw that regardless of what we
picked up, we would end up having to

do a lot of work and we'd almost,
you know, have to take partial

ownership of the project.
And we thought this umbrella of

projects under Apache, under the
Apache Foundation, under arrow were,

you know, they were they were early,
but they were promising.

And if we really put our effort
behind it, it would cause more

people to also start programing
against it and whatever.

So in November of 2020,
I announced, like, hey, working

on a new core of the database.
It's called IOCs because nobody

was comfortable with me calling
it InfluxDB 3.0 because again,

they're like, you're not going
to rewrite the database,

so there's no way we're doing this.
I was like, well, we'll just

let's just see how it goes.
Um, so but yeah, at that point I

announced it and still it was
like three of us working on it.

It was still very, very early stage,
but that allowed us to it got, you

know, a number of people out there in
the world interested in the project,

and we hired some great people that
joined us so that by March of 2021,

you know, we had a team of,
I think nine people and we spent

years writing a database,
which we launched into

production earlier this year.
So how long did it take before

you were allowed to call it 3.0?
Before people stopped telling,

you know,
you can't rewrite this database?

Uh, I mean, we we announced
publicly that it was InfluxDB

3.0 on April 26th of this year.
So inside the company, it took you

like 2 or 3 years to get everybody on
board with the idea of no, really,

we did. And we're about to finish it.
By, I would say by.

By probably, I mean.
Basically by I'd say 2021,

like summer or fall of 2021,
people within the company are like,

okay, we need this like new core
database engine.

Because like at that stage,
it was obvious what the limitations

of the previous engine were.
Right in the beginning,

people were like, Paul,
what are you talking about?

Like, you know, there were some
people who got it intrinsically

and other people were like,
we don't need to do this right now.

And then by again,
like I'd say like the fall of 2021,

everybody in the company was like,
okay, we definitely need this

new database engine.
When's it going to be ready?

And I'm like, guys, you know,
we're we're not baking a quiche here.

Like it's going to take some time.
Um, yeah.

So, uh, so yeah, by and then by,
I'd say like the spring of 2022,

it was like, okay, this is, this is
obviously what we need to be doing.

And then definitely by the fall
of 2022, it's like, okay,

we're getting everybody focused
on this new database engine and

we're going to call it 3.0.
And then it just became a

question of when, when we were
going to be more public about the

fact that it was 3.0, but in my
mind it was always InfluxDB 3.0,

even though it is a total rewrite
and the database architecture is.

Drastically different than the
underlying database architecture

of one and two. All right.
Nice I love that.

We just ask a question and then
set you loose, and then.

You just go for it. Sorry.
Yeah, go for it. No, don't be sorry.

That's the good part. It's fine.
Well, don't know if there should

be more back and forth.
No, no, no,

that was absolutely perfect.
And you know, you kind of

answered my second question,
which is good as well because we're

thinking through the problem space.
Well, here, I think there's a lot

of context about what happened in
the industry, what happened with

the early versions of InfluxDB, why
this rewrite was required, why rust?

Why all these things now make a
lot more sense, right?

But, you know,
let's reemphasize one of those things

you talked about with the 20 tens.
It's like this was the decade where

containers and cloud took off.
All right.

People were using ephemeral compute,
spinning up VMs on GCP as they're

launching dozens to hundreds to
thousands of containers

orchestrated with Kubernetes.
And all of these have their own

signals.
They have their own logs,

they have their own metrics,
they have their own traces.

Traces are now important because at
the same time of this wild cloud

container orchestration evolution,
people decided to start doing

microservices because the
technology enabled all these new

architectures too.
So I spend a lot of my time,

I'm just going to set context and
not ask questions and then just

let you infer the questions.
But still, I spend a lot of my

time working with companies that
are trying to build out platforms.

You know, they want to make it
easier for their developers to

deploy to production.
And I think one of the

challenges I've seen is that
people really struggle with,

I need a database for logs,
I need a databases for tracing,

I need a database for my metrics.
I need to be able to aggregate

and query and all this stuff,
and they make it really complicated

for what is, in essence,
all the same data structure.

I don't think there's that much
difference from a metric,

a trace and a log.
It's all a collection of events.

The difference between a trace and a
metric is an aggregation of some raw

level event, and the challenge has
always been the cost of storing it

at this super high dimensionality
with super high dimensions versus

the easiness of querying it.
Which is why we probably do

metrics and terrible versions of
histograms and all this other stuff

that we now accept as the norm.
Right? Yeah, does.

And I'm going to quote something
on your website that you might

hate me for, right.
But when we talk about IOCs and

DB three, you specifically say
infinite cardinality.

So the question is does IOCs gevers
or can or InfluxDB three sorry,

give us a single store for all
of these signals, all of this

observability and monitoring data.
And can you give us a bit more

insight into what that infinite
cardinality actually means in a

practical term? Yeah. So?
To store the data.

Yes, 3.0 can store all of that
kind of kind of data, right?

Um, because of the fact that
cardinality doesn't matter,

the problem becomes what happens
when you try to query that data.

Pulling it out. Right.
And that's basically like the the

patterns for how you query the
data and what people expect or

why you end up with, you know,
I think why you end up with three

different systems for, for storing
each of those kinds of data,

right? Traces, metrics and logs.
In my mind, they're all just like

different views of the same thing.
Ultimately, like if you wanted to,

you could just have traces and
skip the logs and metrics. Right?

You can infer you could derive
everything else from raw traces

because traces, again, like you could
have just a blob string field in

the trace that has log info. Right.
So but the problem is the problems

are like what happens at scale.
Right?

And when you start generating a
ton of this data,

do you end up having to sample it.
And then how do you actually

access the data.
What are the access patterns. Right.

So for right now you know the.
The metric access patterns are like,

I have this metric and I want to
look at it.

And the idea behind metrics is
you're actually metrics are a

summary view of some underlying
distribution or some underlying

thing that you're tracking.
Generally, metrics are not the

raw high precision view, right?
For example, if you want the

average response time in one minute
intervals to an API call, right,

a specific like API endpoint,
right now the raw view is

individual requests, right.
And you log every bit of detail you

would want on that request. Right.
What host received the request?

What user submitted it,
what token they were using, what

endpoint the actual data included
in the request, the response time,

the response itself. Right.
You could you can get down to just

an insane level of precision.
But the problem is, to do that at

scale is completely unreasonable.
It generates more data than you ever

even stored and all this other stuff.
So you end up creating these

systems to summarize things.
And the problem that people

frequently run into is like, well,
if you didn't think ahead of time

of what you needed to summarize,
when you go to look at the summaries,

your metrics,
the answer you need isn't there and

the requests already happen. Right.
So logs are a way to capture more

detail and then kind of like try to
figure it out after the fact. Right.

So the idea with logs is it's it's
something where you're doing an ad

hoc investigation where you're not
continuously like looking for some

signal that triggers like a problem
in the system, but so ultimately

like storing all that data,
you can use the same format.

Parquet is a storage format,
for example, can store all that kind

of data, but querying it effectively
and efficiently is difficult.

And that usually requires either
secondary indexing structures or

other ways to organize the data
so that you can actually query

it effectively.
Now Aspirationally 3.0 wants to be.

The home for all that kind of data.
Basically,

for any kind of observational
data you can think of. Right.

And like for us, it's not just like,
you know, the server infrastructure

monitoring use cases, but also more
and more sensor data use cases.

And again, with sensors you find
the same kind of thing, right.

People can deploy more sensors
in their environment for the

machines they're tracking or the
environments they're tracking,

but that they can also increase the
precision of the measurements. Right?

They can increase the sampling
interval.

They can increase the precision in
terms of what gets tracked with each

measurement that gets created. Right.
And that's all the metadata that

you could potentially track
about something with sensors.

You could be, again,
like all the stuff around the

customers or the users.
It could be around the location,

the, you know, the lat long,
like all that other kind of stuff.

So getting there, like we're there in
terms of being able to store it,

we're not there in terms of
being able to query it.

We organize data into large chunks,
and then the query engine just

kind of brute forces those chunks.
So I think. The it's. Yeah.

That that is going to take some
time to get to the stage where we

can actually do all those things.
I think, uh,

there's some other stuff around,
like with logs and tracing use

cases where the schemas are very,
very dynamic and they're not

always consistent. Right.
If you try to like, pull out

structured fields from these things,
a lot of times people won't have the

same field types for something that's
named the same thing because they're

in different services or whatever.
So those are all just like kind

of like weird. They're fun. Yeah.
Infrastructure like horrible,

like problems that you just end
up having to deal with.

Um, so whereas like right now I
would say with 3.0, it's better

for like structured events, right,
where you have events that you're

tracking and you want to get high,
high precision data,

right where you can slice and dice
it any way you choose. Right.

So systems where you can think of
like use cases where you think that'd

be good is like if you're doing
usage tracking and an API audit,

logging any type of individual
events metrics is also a use case.

Obviously, that InfluxDB is used for.
And this engine is useful for as

well.
Um, but logs and traces are a bit

trickier because of the again,
like kind of how flexible the

schemas are as people deploy
them in different systems.

And the thing with tracing that's
weird is all the like, I think most

of the tracing front ends look
like you're looking at a metric

view or a log view and it's like,
oh, go look at the trace.

So basically what you're doing is
you're jumping off to look for a

trace by a trace ID, right.
Which is a essentially that implies

what you want is an index which maps
an ID to an an individual trace.

And of course, time series database
isn't really designed to do that.

Or like the way our database is
structured is not really

designed to do that.
Now, there are ways ideas we have for

being able to layer that in without
having to create super expensive

secondary indexing structures,
but all of that stuff is going

to take time.
So I think with tracing it would

would make it easier is when you
have a trace ID, you basically

always have a time range as well.
So for a system that stores

traces at scale,
it would be easier to say like, oh,

give me this, give me this trace ID,
but this is like the time

envelope that it appears in.
At least that's what I'm imagining.

Make it easier. But yeah.
Uh, I don't know. Yeah.

Metrics, logs and traces is basically
like the gold standard for how

you do observability right now.
But I don't think.

I don't think that's the end state.
I think it's not ideal,

like the usability isn't ideal.
It can be painful.

Tracing is super like tracing is
expensive from a development

perspective in terms of putting
it into your code.

But it's also expensive in terms
of like an infrastructure and like

being able to collect all this data.
And then you get into like figuring

out, okay, do we need to do sampling
because there's too much and it's

all still too difficult to use,
which tells me there's a lot of

room for for innovation.
But the thing is,

there are a lot of really smart
people trying to fix these problems.

It's just really hard because the
like, the volume of data keeps

increasing and the demands of the
user base also keep increasing.

Right.
Yeah, I used to work at logging

company and back around the same time
when you were discussing, how do

we move this off to other things?
Um, and I remember that being

part of the discussion that we
were handling petabytes of data.

And how do you handle just that much?
Um, but I'd find it interesting

if you do, you end up with a lot
of legacy systems as well,

because that was always the issue
with logging, is that to go back

and change it from the string,
that who knows what the developer

decided to put in there if they
decided to put anything in there.

And then there's all these
different levels of logs and

they're all strings that you
have to figure out how to parse.

Then eventually people move to
structured logging,

but that wasn't a thing.
So that's a whole nother can of

worms to open, I imagine.
Yeah, I mean, you then you have to

either like, parse the legacy logs
into some sort of structure, right?

I mean, and again,
like the structure is really about.

Query performance. Right.
Because you don't you could you could

create the structure on the fly.
Right?

You could just store all the
logs and whatever.

And basically like at query time,
parse each log line in the structure.

And again, like you could do
this on the fly to be like, oh,

well that query didn't work,
but I got this error back.

So I'll change my the way I'm
structure, I'm trying to parse

that particular line or whatever.
But the problem is that those queries

are super expensive to run, right?
They cost a lot of CPU,

they cost a lot of network bandwidth.
So really like the thing about

structured logs and even metrics
is about essentially introducing

structure and summarizations that
make the queries efficient enough to

run for whatever the use case is.
And I think that's also one of the

thing to keep in mind is like,
is this a case where you're building

for a system that does automation,
where it needs,

like the query to return in,
you know, tens of milliseconds,

hundreds of milliseconds?
Is it something where it's like and,

you know,
the automated systems generally like

those queries run all the time,
right? Many, many, many times.

Right. Or is it.
An ad hoc query from a user who's

doing some investigation, in which
case they're more than willing to

wait tens of seconds for a return.
And those queries are few and

far between. Right.
You're not doing that.

Many of them, and I think. The.
I think the ad hoc thing is,

is probably easier to solve with
that pattern of let's take these

blobs of data, put them into object
storage, and spin up compute on

the fly to handle those.
Um, it's the it's the it's the

real time systems where you have
to really think about layering

in structure and optimizations,
that you can answer the queries

fast enough. All right.
So you mentioned at the start how you

had this InfluxDB in version one,
this SQL, SQL like language that

frustrated people because it wasn't
real sequel that they were familiar

with working with other databases.
With. Data fusion.

Do people get access to more
traditional SQL?

That's part one of the question.
Part two is with InfluxDB two,

there was a lot of investment into
the flux language with the messaging

around how flux was purpose built
for time series, then SQL wasn't.

I'm curious if that has changed.
And do we feel now that is the

right language for time series,
or is there still a future for

flux with InfluxDB three? Yeah.
So first I'll mention about InfluxDB.

So some people found it frustrating
because it was on like SQL.

But a surprising number of people
have told us they actually prefer

InfluxDB to SQL for writing some
basic time series queries. Right?

Because it's just like super easy
to write a thing where you're,

you know, getting a summarization
by these different things.

Whereas like in a SQL engine,
you might have to deal with

windowing functions and partitioning
and like all this other stuff.

So I think InfluxDB has a place that
will continue, will continue to

serve, just because we've gotten
that feedback that a number of

people actually prefer it to SQL.
So with data Fusion,

we have a fully featured SQL engine
that supports all of these things.

Joins window functions,
partitioning like all this,

all this complex SQL stuff.
We've been adding in more and more

functions to make time series stuff.
Uh, to make it more capable of doing

time series specific style queries.
And we'll continue to do that.

We're certainly not done with that.
Um, so the thing with 3.0 is, again,

it's built around this data fusion
query engine, which is a SQL engine.

Now all that's written in rust,
obviously 2.0 is written in go.

And flux as a language was we
developed for 2.0 and we developed

it. So there were a couple of things.
One, for the users of InfluxDB,

one, they frequently had requests
that they just wanted to express

more arbitrary and complex logic
in their queries than could be

expressed in SQL or indeed in SQL.
So we were just like,

we need essentially like we need
like a scripting language paired

with a query language so that people
could do more complex things.

And it became like, oh,
you can use this also to connect

to third party systems,
to join with your time series data

on the fly, or to send data out
to third party systems, right?

Be it another database or a third
party API or whatever. Right?

So that was that was part of it.
Another part of it was I had a thesis

back in whatever, like, I mean,
originally in the fall of 2014,

I was talking about changing the
language from InfluxDB to

potentially something that looked
more like a functional language,

which is what flux is.
And I decided at that time to

stay with flux.
But I had a theory that like, oh, I

think for the time series use case, a
functional language would be better.

It would be more like expressive
and more powerful for working

with time series data.
And flux was our attempt at that.

Right. And I think so.
We built obviously the flux language,

the scripting language,
the query engine, the planner,

the optimizer, everything from
the ground up, which is a very

large like that's basically like
two very large separate projects.

Um, so, uh, and all of that is
written in go now coming to 3.0.

We thought, okay,
we need a way to to bridge flux.

And we also want to see if we
can bring over influx.

And with influx we had the idea.
Well it looks like SQL.

So what we can probably do is write a
parser in rust to parse the language,

which isn't too hard to create.
That will then translate an SQL

query into a data fusion query plan.
So we had one person start on

that in the summer of last year.
And basically now this year we

have that actually works.
And it works really,

really well. Right.
Because and it was basically

just one person who did a lot of
that effort last fall.

We added on additional people onto
the team to build the last like bit,

which is like API bridge,
which is to represent all of that

with the InfluxDB one query API.
So we're really happy we were

able to bring that over,
and it gets all the benefits of that

data Fusion Query engine, right?
So when they're when they're

performance optimizations or other
things like that, they just come

they when we pull that in InfluxDB
gets all of that for free.

Now with flux, what we tried to do
was like it was like the surface area

of it is way, way too large to try
and like create it again in rust.

Although I would love to do that.
Um, it's just it's just like way,

way too big of a project to do.
And we don't we don't have the

time or resources to do it.
So what we did was in our cloud two

platform, we had flux processes
that communicated with the old TSM

storage engine via this gRPC API.
So what we did was in in 3.0 we

created that gRPC API,
and we connected flux up to it.

And what we found through like
some production mirroring and

actually letting customers test
it was one that gRPC API was like

kind of it had some edge cases
that were poorly specified.

So there were weird bugs that
would pop up that we just

unforeseen bugs like that would
surface in the flux query.

Things that worked on the previous
one that don't work on this,

but more importantly, the performance
of that bridge is not good, right?

It basically there were there
were queries that would work in

the old flux to TSM version that
worked in a decent amount of time,

and queries on the flux of 3.0
bridge that just like timed out.

And again, like one of the things
we were trying to do with 3.0,

one of the reasons we adopted
this new query engine is because

we wanted queries to be super,
super fast, right?

Query performance was a super
important thing, and a lot of

flux queries like we saw,
there were queries that just

wouldn't work in that engine.
So when we created this bridge,

because of the way that API works,
it's literally it's built around

how data is organized in that
TSM storage engine.

And the 3.0 engine does not have
the same organization.

So in order to present the data
in that organization,

the 3.0 engine has to do a lot of
like post hoc sorting and filtering.

And that sorting is basically
you chew up CPU time doing that.

And basically it wasn't performance.
So like right now we have a

theory that there would be a way
to update the flux engine so

that it uses the 3.0 native API,
which is basically a flat SQL,

and that that the flux engine
could do the work itself.

To reorder the data and the way it
expects it to be ordered, and maybe

that would be performant enough.
But for the time being,

we're not focused on that.
We're still focused on the core query

engine, which means InfluxDB and SQL,
and adding capabilities and

performance to it that that like
right now, it's already faster

than 1.0 at a number of queries.
But there are still some queries.

That's definitely not faster
with and we want to improve

those situations and spend our
our time on that for now.

And then see,
see later what we can do with flux.

We are people in the community have
expressed interest in actually

self-organizing to do that work.
So we've actually created a separate

community fork of flux that we're
going to be pointing people to.

And that fork will be a place
where people can collaborate on

this idea without.
The thing is, we can't do it in our

primary branch because we run this in
production, in our cloud environment.

It's just too difficult to try
and pull these changes in,

as people like we need to give people
the ability to, like, iterate on

their own without having to go
through our production pipeline.

So that's the idea is in 3.0 in
InfluxDB and are native and

supported flux.
We're still trying to figure out

what we I mean, I will say a
separate thing about flux,

which is maybe not obvious to people,
or for some people, maybe it is,

but the language is highly
polarizing, right?

It's a new language, and a lot
of people are like, I don't want

to learn your stupid language.
And I get that, um, I did not

really get that six years ago,
but I get it now.

Um, and so there's basically what we
found is that a lot of people just

didn't want to pick up the language.
They wanted to work with

something they already know.
And again, with InfluxDB QL,

it's a different language.
But the thing is,

it looks like it feels.
It feels like an old friend,

like you.
You know it so you can pick it up

without having to do too many things.
But with flux, it was a serious

adoption blocker for a lot of people.
But then on the other side of this,

there is a slice of people who took
the time to learn flux, and they

absolutely love it because they can
do things in that language that

they could not do in SQL or in SQL.
And I think that is kind of a

testament to the like,
the reason why we built the language,

because we wanted something that
enabled arbitrary processing

inside the core of the database.
So again, it's like one of those

things where depending on who
you are and how you look at it,

flux is either like a great thing
and it's like we need to keep

pushing this forward or it's why
these guys build this language.

This doesn't make any sense.
It's it's tricky. All right.

Awesome. Thank you.
Did you have a question, Laura?

No, no, just random commentary. Yeah.
We're running out of time very

fast here, so let's finish up with
something a bit a bit different.

Um, recently, HashiCorp announced a
license change to the the bustle

license, and you posted some thoughts
on Twitter saying that it kind

of gave you a bit of pause for
thought on what source available

in open source is or the future.
And I'd love for you just to

talk about that and maybe even
bring it into the context of

what's the future for the
license on InfluxDB three. Yeah.

So, uh, so basically in my mind,
open source, the, the, the bustle,

the community licenses, all those
things, those are not open source.

Those are basically the new
version of shareware or

commercial freemium software,
right? It's commercial software.

And if and frequently they will
offer that software to you to use

for free under certain conditions.
And if you are one of the people

who meet those conditions,
maybe you're happy and you'll be

able to use it, right?
Tons of people continue to use

MongoDB, the SSL version.
Tons of people use elastic still

after they change their license.
Redice confluence like they're,

you know, literally every single
infrastructure open source

creator has changed their
license over the last six years.

Like there isn't one I can think of
who hasn't? Except for us, maybe.

So, uh, and I totally get the
motivations to do that.

But in my mind,
like open source, I actually

don't like copyleft licenses.
I like GPL, GPL, I don't consider

them to be real open source.
To me,

open source is really about freedom.
Freedom to do what I want with

that code.
And if you put any sort of

restriction on it,
which a copyleft license does have

restrictions on it, then that's,
you know, restricting my freedom.

To me,
open source is about freedom. So.

And again, I think for a company
that's producing open source code.

You have to be okay with the fact
that people are going to do anything

with that code, including up to
and including competing with you.

And if you are not okay with that,
you should not be putting that

code out as open source, because
that is what's going on. Right?

And generally the best thing for
a company to do that's building

products and stuff like that is
to only put open source code in

something that they wish to have
become a commodity. Right?

So the operating system,
the server operating system that

your, your, your servers run on,
you want that to be a commodity.

Many times you want the database
to be a commodity.

You want this these core
infrastructure components that are

essentially not part of the value
that you deliver to your customers.

Essentially, you want them to be
commodity, so you don't have to pay a

lot extra for those things, right?
You want the price of those things to

be driven down as low as possible.
But if you are a vendor,

the thing you sell, you don't
want that to be a commodity.

You want to be able to sell it
for the highest possible price.

And the so the problem that
vendors have that create a project

where their primary monetization
path is essentially that project

plus something or whatever,
it becomes like, well,

we're putting all this effort into
the open source thing, and a bunch

of people are using it for free,
and there are a bunch of freeloaders,

and they're a bunch of like,
competitors who are taking our stuff,

and then they decide to change the
license. And the problem I have.

So there there are multiple
pieces to this.

One is as the creator of a project,
if I want it to get the broadest

possible adoption,
I want more people to use it.

I'm incentivized to have that project
be permissively licensed. Right?

All all other all else being equal,
a developer or a user looking at

a piece of software, literally,
if everything else is the same and

the difference is a commercial
license or a permissive Apache

or MIT license, they're going to
pick the permissive one, right?

There's no reason not to because
you get a bunch of other stuff.

So they choose that.
But the problem now that I think

the HashiCorp thing kind of
highlights is it's just yet

another vendor in a long list of
vendors over the last six years

who've changed their licenses.
And now, basically, it's causing a

lot of distrust in the developer
community because they see like, oh,

here's a new open source project.
And they look like is that

project by a VC backed company,
right, a VC backed startup.

If it is,
they're going to be like, well,

it's an open source project. Yeah.
But I don't believe that it's going

to continue to be open source,
which is a totally valid thing to

think given how things have been
going over the last six years.

So, you know, previously, again,
my thesis, if you want broadest

possible adoption,
put it in a permissive license.

That idea is kind of getting damaged
by the fact that people keep changing

their license from permissive to
something commercial, right?

And I don't know if there's a
solution to that.

I do think separately,
I think that HashiCorp probably

made an error here.
Like, the thing is the license change

only protects forward commits, right?
You can't retroactively change a

license.
So they could have just as easily.

Put all their.
Development effort into a closed

source private fork and then made
the open source piece a downstream.

You know, dependency of that
closed source fork and then not

tell anybody about it, right?
Just be like, yeah, we're just

going to do this or whatever.
Or conversely, they could have

just said, we're going to donate
this thing and do a foundation,

but all of our developers are
going to be working on this

closed source thing, right?
The effect would have been the same

in terms of the end outcome. Right.
Because you have this open

terraform fork.
But the difference in everybody's

perception of HashiCorp as a
company would have been

dramatically different. Right?
The business and commercial effect

would have been exactly the same,
which is all their R&D, all their

development tokens are going only
into the commercial software,

which is fine. That is their right.
If they want to do that,

they should totally do that.
And it's also fine if they want

to change their license.
That is totally permissible, right?

There's no just because you
create something open source one

time doesn't mean you owe it to
the world to continue for the

rest of your life,
to put more and more into it. Right.

Um, but I think there were more
graceful ways to handle it that

could have delivered the same
business outcome for them,

at the very least. All right.
Let's extend that question by

one sentence and just say, like,
does the fear that people have now

and rightly so, about single vendor,
VC backed open source projects

not always being open source?
Does that mean that InfluxDB

three could be an Apache
Foundation project in the future?

Or are you not worried about the
perception or the fear of people

adopting it because it's a VC
backed single vendor project?

So I do not think we will put
InfluxDB three into a foundation.

My goal is to have a permissive
license project.

Um, but the truth is, like I don't.
The problem with foundation projects

is the the bar is usually too high.
Like InfluxDB doesn't meet the

bar for most foundations, right?
There aren't multiple companies

contributing to InfluxDB, you know,
version one, two, or three, right?

We're the ones developing it alone,
so it doesn't really hit the

level of a foundation. Terraform.
You certainly could have put it

into a foundation.
I'm certain a number of foundations

would have taken it gladly as a as a
project, as a top level project.

But InfluxDB doesn't hit that.
It's not it's not on the same level

in terms of the contribution.
And like all this other stuff,

um, there's that, uh, I mean,
I think, I don't know,

like for broadly for the,
for the community and building trust

with people and stuff like that.
I don't know if there's a solution.

Right.
I think the license itself matters

because if it's Apache two or MIT,
then, you know,

people can do whatever they want
with that software for all time

as long as for that point.
But maybe there's a model where

you can commit to, you know,
transitioning into open governance

over some period of time.
I think early on in a project,

open governance is actually more
of a hindrance than.

You know, a benefit because you
want like a small,

tight knit group of people that
are driving the project forward.

Um, but yeah, I don't know.
It's tricky.

There are so many things I want
to say and argue about here,

but no, we're out of time.
Um, I think the one thing I guess I

will say is the one little challenge,
and I'll just kind of leave it here,

I guess. Uh, Red hat tried the.
We're going to switch the order

of things and make the
downstream the open source part.

And I would argue that the
community really doesn't like that.

So I don't know.
There's a lot of different ways.

I admit that I am a huge fan of
the Agplv3,

but maybe that's why I will argue it.
But we unfortunately don't have

time and want to really badly.
To be continued.

To be continued. That's correct.
Yeah, we could have an entire

session on open source licensing.
I could talk about that for like

hours. Should be so much fun anyway.
All right. Yeah, well.

We want to keep you any longer,
so feel free to use 30s if you wish,

just to tell people where they can
learn more about influx, follow

your Twitter, anything like that.
Feel free to shamelessly plug it

in if you wish.
On Twitter I'm at Paul Dix InfluxDB

where influxdata.com or influxdb.com.
You can find us 3.0.

We have available as a
multi-tenant cloud product,

as a dedicated cloud product,
and soon open source builds or

builds of 3.0 will be available.
But we're not.

Again, we're focused on our
commercial offerings at the moment

for obvious reasons. But yeah.
All right. Thank you very much.

Thanks for joining us.
If you want to keep up with us,

consider subscribing to the podcast
on your favorite podcast app, or

even go to cloud native compass FM.
And if you want us to talk with

someone specific or cover a
specific topic, reach out to us

on any social media platform.
Until next time when exploring the

cloud native landscape on three.
On. 312, three.

Don't forget your compass.
Forget your compass.

Episode Video

Creators and Guests

Host

David Flanagan

I teach people advanced Kubernetes & Cloud Native patterns and practices. I am the founder of the Rawkode Academy and KubeHuddle, and I co-organise Kubernetes London.

Host

Laura Santamaria

🌻💙💛Developer Advocate 🥑 I ❤️ DevOps. Recovering Earth/Atmo Sci educator, cloud aficionado. Curator #AMinuteOnTheMic; cohost The Hallway Track, ex-PulumiTV.

Guest

Paul Dix

CTO of @InfluxDB (YC W13), founder of NYC Machine Learning, series editor for Addison Wesley's Data & Analytics, author of Service Oriented Design with Ruby.